Article published In: International Journal of Corpus Linguistics
Vol. 10:3 (2005) ► pp.307–334
The advantage of using relational databases for large corpora
Speed, advanced queries, and unlimited annotation
Published online: 1 September 2005
https://doi.org/10.1075/ijcl.10.3.02dav
https://doi.org/10.1075/ijcl.10.3.02dav
Relational databases can be used to create large corpora that provide both very good search performance and a wide range of queries. This paper outlines how this approach has been used to create theCorpus del Español, which contains 100 million words of text in Spanish texts from the 1200s-1900s. The main databases are composed of n-grams tables (all unique 1, 2, 3, and 4 word sequences) and the associated frequency of all n-grams in each century (historical Spanish) and register (Modern Spanish). These tables are then joined to other tables containing part of speech, lemma, synonyms, and user-defined lists of words and lemma. There is essentially no limit to the amount of annotation that can be added in additional tables (with little or no impact on performance), and the SQL-based queries allow a wide range of searches that are not available with traditional corpora.
Keywords: n-grams, Spanish, historical, relational databases, SQL
Cited by (13)
Cited by 13 other publications
Evert, Stephanie, Timm Weber, Steffen Bothe, Philipp Heinrich & Alexander Piperski
Hong, Anxian & Dongping Hu
FILE‐MURIEL, RICHARD J.
Haas, Timothy C.
Lavid-López, Julia
2021. Corpus resources and tools. In Corpora in Translation and Contrastive Research in the Digital Age [Benjamins Translation Library, 158], ► pp. 1 ff.
Zięba, Anna
Arkhangel’skii, T. A. & O. A. Sozinova
Upeksha, Dimuthu, Chamila Wijayarathna, Maduranga Siriwardena, Lahiru Lasandun, Chinthana Wimalasuriya, N. H. N. D. de Silva & Gihan Dias
Huo, Yan Juan
Duchon, Andrew, Manuel Perea, Nuria Sebastián-Gallés, Antonia Martí & Manuel Carreiras
Kratky, Michal, Radim Baca, David Bednar, Jiri Walder, Jiri Dvorsky & Peter Chovanec
Gries, Stefan Th.
This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
