In:Language and Text: Data, models, information and applications
Edited by Adam Pawłowski, Jan Mačutek, Sheila Embleton and George Mikros
[Current Issues in Linguistic Theory 356] 2021
► pp. 239–256
Quantitative analysis of bibliographic corpora
Statistical features, semantic profiles, word spectra
Published online: 22 December 2021
https://doi.org/10.1075/cilt.356.16paw
https://doi.org/10.1075/cilt.356.16paw
Abstract
The subject of this chapter is bibliographic corpus analysis, with data from the Polish national bibliography from the period 1801–2019. The research allowed us to discover and compare quantitative characteristics of the bibliographic corpus and of the reference corpus of general language. It was shown that the two corpora differ significantly. In particular, differences in the share of particular parts of speech and of the frequency distribution of lexemes were demonstrated. The statistical distributions of word spectra were also studied. The best fit was obtained for generalized inverse Gauss-Poisson and Zipf-Mandelbrot distributions. The analysis of parameters of both distributions for bibliographic and reference corpora also revealed differences between them. The best perspective for future research on bibliographic corpora is, apart from quantitative linguistics, semantic analysis and text-mining.
Article outline
- 1.Large-scale bibliographies as text corpora
- 2.Data and hypotheses
- 3.Research method
- 4.Results: An overview
- 5.Results: Statistical distributions
- 6.Conclusions
Notes Sources References
References (21)
BN Data: [URL]
CLARIN-PL infrastructure: [URL]
NKJP: [URL]
WCRFT2 morphosyntactic tagger: [URL]
ZipfR package: [URL]
CHBB. 1999–2019. The Cambridge history of the book in Britain. Volumes 1–7. Cambridge: Cambridge University Press.
Cressie, Noel & Timothy R. C. Read. 1984. Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society. Series B (Methodological) 46(3). 440–464. [URL].
Evert, Stefan & Marco Baroni. 2007. ZipfR: Word frequency distributions in R. In Sophia Ananiadou (ed.), Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Posters and Demonstrations Session, 29–32. Prague: Association for Computational Linguistics. [URL] (9 September, 2020.)
Green, Jonathan, Frank McIntyre & Paul Needham. 2011. The shape of incunable survival and statistical estimation of lost editions. The Papers of the Bibliographical Society of America 105(2). 141–175.
Grotjahn, Rüdiger & Gabriel Altmann. 1993. Modelling the distribution of word length: Some methodological problems. In Reinhard Köhler & Burghard B. Rieger (eds.), Contributions to quantitative linguistics, 141–153. Dordrecht: Kluwer.
Lahti, Leo, Jani Marjanen, Hege Roivainen & Mikko Tolonen. 2019. Bibliographic data science and the history of the book (c. 1500–1800). Cataloguing & Classification Quarterly 57(1). 5–23.
Mačutek, Ján & Gejza Wimmer. 2013. Evaluating goodness-of-fit of discrete distribution models in quantitative linguistics. Journal of Quantitative Linguistics 20(3). 227–240.
Mandelbrot, Benoît. 1962. On the theory of word frequencies and on related Markovian models of discourse. In Roman Jacobson (ed.), Structure of Language and its Mathematical Aspects (Proceedings of Symposia in Applied Mathematics 12), 190–219. Providence, RI: AMS.
Schwetschke, Gustav. 1850. Codex nundinarius Germaniae literatae bisecularis. Teil: 1564 – 1765. Halle: G. Schwetschke’s Verlags-Handlung und Buchdruckerei. [URL] (9 September, 2020.)
. 1877. Codex nundinarius Germaniae literatae bisecularis. Teil: Forts. 1766 bis 1846. Halle: G. Schwetschke’s Verlags-Handlung und Buchdruckerei. [URL] (9 September, 2020.)
Sichel, Herbert S. 1975. On a distribution law for word frequencies. Journal of the American Statistical Association 70. 542–547.
1982. Asymptotic efficiency of the three methods of estimation for the inverse Gaussian-Poisson distribution. Biometrika 69. 467–472.
Tolonen, Mikko, Jani Marjanen, Hege Roivainen & Leo Lahti. 2019a. Quantitative approach to book-printing in Sweden and Finland, 1640–1828. Historical Methods: A Journal of Quantitative and Interdisciplinary History 52(1). 57–78.
. 2019b. Scaling up bibliographic data science. In Costanza Navarretta, Manex Agirrezabal & Bente Maegaard (eds.), Proceedings of the Digital Humanities in the Nordic Countries 4th Conference, 450–456. Copenhagen: University of Copenhagen. [URL] (9 September, 2020.)
Cited by (2)
Cited by two other publications
Pawłowski, Adam, Tomasz Walkowiak & Lars G. Johnsen
2025. Linguistic correlates of semantic knowledge ontologies. In Mathematical Modelling in Linguistics and Text Analysis [Current Issues in Linguistic Theory, 370], ► pp. 161 ff.
This list is based on CrossRef data as of 6 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
