A human-friendly corpus query language not only for linguists: Chapter 10. Query a corpus in near-natural language

Milička, Jiří; Šebestová, Denisa

doi:10.1075/scl.119.10mil

In:Crossing Boundaries through Corpora: Innovative corpus approaches within and beyond linguistics
Edited by Sarah Buschfeld, Patricia Ronan, Theresa Neumaier, Andreas Weilinghoff and Lisa Westermayer
[Studies in Corpus Linguistics 119] 2024
► pp. 248–262

Get fulltext from our e-platform

Download Book PDF

Download Book EPUB

Chapter 10
Query a corpus in near-natural language

A human-friendly corpus query language not only for linguists

Jiří Milička | Charles University in Prague

Denisa Šebestová | Charles University in Prague

Published online: 17 October 2024

https://doi.org/10.1075/scl.119.10mil

Abstract

This paper addresses the pressing issue of accessibility of corpora to users who are not able or willing to learn a formal query language. It introduces a working online automatic translator from a near-natural language into the Corpus Query Language (CQL), as used in SketchEngine, Czech National Corpus web applications, and other services. The translator does not require strict syntactical patterns and allows for a certain amount of typing errors, using the redundancy associated with natural language. It allows querying corpora of 35 languages hosted by the Czech National Corpus infrastructure, all of them annotated in the Universal Dependencies formalism. Alternatively, the translated CQL code can be employed in other compatible systems. The paper both presents the theoretical assumptions of our solution and outlines the details of its implementation, including examples of use.

Keywords: corpus linguistics, corpus, query language, large language models, universal dependencies formalism, Czech national corpus

Article outline

1.Introduction
2.Methodology
3.Example queries
4.Examples of use outside linguistics
5.Testing
6.Conclusion
Notes
References

References (19)

References

Baker, Paul. 2006. Using Corpora in Discourse Analysis. London: A&C Black.

Cvrček, Václav & Richterová, Olga (eds). 2020. En:cnk:Intercorp:verze13ud. Příručka ČNK (Czech National Corpus Handbook). <[URL]> (29 May 2024).

Evert, Stefan & Hardie, Andrew. 2011. Twenty-first century corpus workbench: Updating a query architecture for the new millennium. <Proceedings of the Corpus Linguistics 2011 Conference> (29 May 2024).

Institute of the Czech National Corpus, C. U. (n.d.). Korpus.cz. <[URL]> (29 May 2024).

Kilgarriff, Adam, Baisa, Vít, Bušta, Jan, Jakubíček, Miloš, Kovář, Vojtěch, Michelfeit, Jan, Rychlý, Pavel & Suchomel, Vít. 2014. The Sketch Engine: Ten years on. Lexicography 1(1): 7–36.

Klégr, Aleš & Malá, Markéta. 2009. English equivalents of the most frequent Czech prepositions. A contrastive corpus-based study. In Proceedings of the Corpus Linguistics Conference, CL 2009, Conference in Liverpool, 20–23 July 2009. Lancaster: Lancaster University.

Laviosa, Sara. 2021. Corpus-based Translation Studies: Theory, Findings, Applications. Leiden: Brill.

Lukeš, David. 2022. Investigating Prosody in Spoken Czech: A Corpus-linguistic Approach. PhD dissertation, Charles University.

Mahlberg, Michaela. 2013. Corpus Stylistics and Dickens’s Fiction. London: Routledge.

McEnery, Anthony & Baker, Helen. 2016. Corpus Linguistics and 17th-Century Prostitution: Computational Linguistics and History. London: Bloomsbury Academic.

McEnery, Anthony & Wilson, Andrew. 2001. Corpus Linguistics. An Introduction. Edinburgh: EUP.

Milička, Jiří. 2021. Alpha (0.3 9). Czech National Corpus. <alpha.korpus.cz> (29 May 2024).

Nivre, Joakim et al.. 2018. Universal Dependencies (2.3). LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University.

Partington, Alan, Duguid, Alison & Taylor, Charlotte. 2013. Patterns and Meanings in Discourse: Theory and Practice in Corpus-assisted Discourse Studies (CADS) [Studies in Corpus Linguistics 55]. Amsterdam: John Benjamins.

Ramscar, Michael. 2019. Source codes in human communication. arXiv:1904.03991. <[URL]> (29 May 2024).

Reppen, Randi. 2011. Using corpora in the language classroom. In Materials Development in Language Teaching, 2nd edn, 35–50. Cambridge: CUP.

Rosen, Alexandr, Vavřín, Martin & Zasina, Adrian Jan. 2022. The InterCorp Corpus — Czech2), version 13ud of 22 December 2021. Institute of the Czech National Corpus, Charles University, Prague. <[URL]> (29 May 2024).

Stefanowitsch, Anatol. 2005. New York, Dayton (Ohio), and the Raw Frequency Fallacy. Corpus Linguistics and Linguistic Theory 1(2): 295–301.

Straka, Milan, Hajič, Jan & Straková, Jana. 2016. UDPipe: Trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, POS tagging and parsing. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 4290–4297. European Language Resources Association (ELRA).

Chapter 10Query a corpus in near-natural language

A human-friendly corpus query language not only for linguists

Chapter 10
Query a corpus in near-natural language