In:Crossing Boundaries through Corpora: Innovative corpus approaches within and beyond linguistics
Edited by Sarah Buschfeld, Patricia Ronan, Theresa Neumaier, Andreas Weilinghoff and Lisa Westermayer
[Studies in Corpus Linguistics 119] 2024
► pp. 248–262
Chapter 10Query a corpus in near-natural language
A human-friendly corpus query language not only for linguists
Published online: 17 October 2024
https://doi.org/10.1075/scl.119.10mil
https://doi.org/10.1075/scl.119.10mil
Abstract
This paper addresses the pressing issue of accessibility of corpora to users who are not able or willing to
learn a formal query language. It introduces a working online automatic translator from a near-natural language into
the Corpus Query Language (CQL), as used in SketchEngine, Czech National Corpus web
applications, and other services. The translator does not require strict syntactical patterns and allows for a certain
amount of typing errors, using the redundancy associated with natural language. It allows querying corpora of 35
languages hosted by the Czech National Corpus infrastructure, all of them annotated in the Universal
Dependencies formalism. Alternatively, the translated CQL code can be employed in other compatible systems. The paper
both presents the theoretical assumptions of our solution and outlines the details of its implementation, including
examples of use.
Article outline
- 1.Introduction
- 2.Methodology
- 3.Example queries
- 4.Examples of use outside linguistics
- 5.Testing
- 6.Conclusion
Notes References
References (19)
Cvrček, Václav & Richterová, Olga (eds). 2020. En:cnk:Intercorp:verze13ud. Příručka ČNK (Czech National Corpus Handbook). <[URL]> (29 May
2024).
Evert, Stefan & Hardie, Andrew. 2011. Twenty-first
century corpus workbench: Updating a query architecture for the new millennium.
<Proceedings of the Corpus Linguistics 2011 Conference> (29 May
2024).
Institute of the Czech National Corpus, C. U.
(n.d.). Korpus.cz. <[URL]> (29 May
2024).
Kilgarriff, Adam, Baisa, Vít, Bušta, Jan, Jakubíček, Miloš, Kovář, Vojtěch, Michelfeit, Jan, Rychlý, Pavel & Suchomel, Vít. 2014. The
Sketch Engine: Ten years
on. Lexicography 1(1): 7–36.
Klégr, Aleš & Malá, Markéta. 2009. English
equivalents of the most frequent Czech prepositions. A contrastive corpus-based
study. In Proceedings of the Corpus Linguistics
Conference, CL 2009, Conference in Liverpool, 20–23 July
2009. Lancaster: Lancaster University.
Laviosa, Sara. 2021. Corpus-based
Translation Studies: Theory, Findings,
Applications. Leiden: Brill.
Lukeš, David. 2022. Investigating
Prosody in Spoken Czech: A Corpus-linguistic Approach. PhD dissertation, Charles University.
McEnery, Anthony & Baker, Helen. 2016. Corpus
Linguistics and 17th-Century Prostitution: Computational Linguistics and
History. London: Bloomsbury Academic.
Nivre, Joakim et al.. 2018. Universal
Dependencies (2.3). LINDAT/CLARIAH-CZ digital library at the
Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles
University.
Partington, Alan, Duguid, Alison & Taylor, Charlotte. 2013. Patterns
and Meanings in Discourse: Theory and Practice in Corpus-assisted Discourse Studies
(CADS) [Studies in Corpus Linguistics
55]. Amsterdam: John Benjamins.
Ramscar, Michael. 2019. Source
codes in human
communication. arXiv:1904.03991. <[URL]> (29
May 2024).
Reppen, Randi. 2011. Using
corpora in the language classroom. In Materials
Development in Language Teaching, 2nd
edn, 35–50. Cambridge: CUP.
Rosen, Alexandr, Vavřín, Martin & Zasina, Adrian Jan. 2022. The InterCorp
Corpus — Czech2), version 13ud of 22 December
2021. Institute of the Czech National Corpus, Charles University, Prague. <[URL]> (29 May
2024).
Stefanowitsch, Anatol. 2005. New
York, Dayton (Ohio), and the Raw Frequency Fallacy. Corpus Linguistics and
Linguistic
Theory 1(2): 295–301.
Straka, Milan, Hajič, Jan & Straková, Jana. 2016. UDPipe:
Trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, POS tagging
and parsing. In Proceedings of the Tenth
International Conference on Language Resources and Evaluation
(LREC’16), 4290–4297. European Language Resources Association (ELRA).
