The paper proposes a methodology for collecting “open-source” corpora, i.e. corpora that are automatically collected from the Internet and distributed in the form of a list of links with open-source software for recreating their full text. The result is a random snapshot of Internet pages which contain stretches of connected text in a given language. The paper discusses a methodology for acquiring such corpora, two ways of documenting them (using a set of metatextual categories and by comparison to frequency lists from existing corpora) and their function as benchmarks for comparing results of linguistic inquiry. Experiments with a variety of languages show that Internet-derived corpora can be successfully used in the absence of large representative corpora that are rare and expensive to build.
2023. Building the Leeds Monolingual and Parallel Legal Corpora of Arabic and English Countries’ Constitutions: Methods, Challenges and Solutions. Corpus Pragmatics 7:2 ► pp. 103 ff.
Li, Jiaxuan, Jinghua Ou & Ming Xiang
2023. Context-specific effects of violated expectations: ERP evidence. Cognition 241 ► pp. 105628 ff.
Li, Rongying, Wenxiu Xie, John Lee & Tianyong Hao
2023. SCA-CLS: A New Semantic-Context-Aware Framework for Community-Oriented Lexical Simplification. In Natural Language Processing and Chinese Computing [Lecture Notes in Computer Science, 14302], ► pp. 69 ff.
Sharoff, Serge, Reinhard Rapp & Pierre Zweigenbaum
2023. Building Comparable Corpora. In Building and Using Comparable Corpora for Multilingual Natural Language Processing [Synthesis Lectures on Human Language Technologies, ], ► pp. 17 ff.
Sharoff, Serge, Reinhard Rapp & Pierre Zweigenbaum
2023. Introduction. In Building and Using Comparable Corpora for Multilingual Natural Language Processing [Synthesis Lectures on Human Language Technologies, ], ► pp. 1 ff.
2020. A New Context-Aware Method Based on Hybrid Ranking for Community-Oriented Lexical Simplification. In Database Systems for Advanced Applications. DASFAA 2020 International Workshops [Lecture Notes in Computer Science, 12115], ► pp. 80 ff.
Song, Jiayin, Yingshan Shen, John Lee & Tianyong Hao
2020. A Hybrid Model for Community-Oriented Lexical Simplification. In Natural Language Processing and Chinese Computing [Lecture Notes in Computer Science, 12430], ► pp. 132 ff.
2018. Web Text Corpus. In History, Features, and Typology of Language Corpora, ► pp. 125 ff.
Ibrahim, Anna, Patricia E. Cowell & Rosemary A. Varley
2017. Word frequency predicts translation asymmetry. Journal of Memory and Language 95 ► pp. 49 ff.
Biber, Douglas & Jesse Egbert
2016. Register Variation on the Searchable Web. Journal of English Linguistics 44:2 ► pp. 95 ff.
Biber, Douglas & Jesse Egbert
2018. Register Variation Online,
Biber, Douglas, Jesse Egbert & Mark Davies
2015. Exploring the composition of the searchable web: a corpus-based taxonomy of web registers. Corpora 10:1 ► pp. 11 ff.
Egbert, Jesse, Douglas Biber & Mark Davies
2015. Developing a bottom‐up, user‐based method of web register classification. Journal of the Association for Information Science and Technology 66:9 ► pp. 1817 ff.
Chang, Ching-Yun & Stephen Clark
2014. Practical Linguistic Steganography using Contextual Synonym Substitution and a Novel Vertex Coding Method. Computational Linguistics 40:2 ► pp. 403 ff.
Kilgarriff, Adam, Frieda Charalabopoulou, Maria Gavrilidou, Janne Bondi Johannessen, Saussan Khalil, Sofie Johansson Kokkinakis, Robert Lew, Serge Sharoff, Ravikiran Vadlapudi & Elena Volodina
2014. Corpus-based vocabulary lists for language learners for nine languages. Language Resources and Evaluation 48:1 ► pp. 121 ff.
2014. Size Does Not Matter. Frequency Does. A Study of Features for Measuring Lexical Complexity. In Advances in Artificial Intelligence -- IBERAMIA 2014 [Lecture Notes in Computer Science, 8864], ► pp. 129 ff.
McCarthy, Diana, Ravi Sinha & Rada Mihalcea
2013. The cross-lingual lexical substitution task. Language Resources and Evaluation 47:3 ► pp. 607 ff.
Moon, Taesun & Katrin Erk
2013. An inference-based model of word meaning in context as a paraphrase distribution. ACM Transactions on Intelligent Systems and Technology 4:3 ► pp. 1 ff.
Wild, Kate, Andrew Church, Diana McCarthy & Jacquelin Burgess
2013. Quantifying lexical usage: vocabulary pertaining to ecosystems and the environment. Corpora 8:1 ► pp. 53 ff.
De Belder, Jan & Marie-Francine Moens
2012. A Dataset for the Evaluation of Lexical Simplification. In Computational Linguistics and Intelligent Text Processing [Lecture Notes in Computer Science, 7182], ► pp. 426 ff.
Fletcher, William H.
2012. Corpus Analysis of the World Wide Web. In The Encyclopedia of Applied Linguistics,
McCarthy, Diana & Roberto Navigli
2009. The English lexical substitution task. Language Resources and Evaluation 43:2 ► pp. 139 ff.
DELIN, J., S. SHAROFF, S. LILLFORD & C. BARNES
2007. Linguistic support for concept selection decisions. Artificial Intelligence for Engineering Design, Analysis and Manufacturing 21:2 ► pp. 123 ff.
McCarthy, Diana
2007. Computers getting the drift. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 365:1861 ► pp. 3019 ff.
McCarthy, Diana
2009. Word Sense Disambiguation: An Overview. Language and Linguistics Compass 3:2 ► pp. 537 ff.
McCarthy, Diana
2011. Measuring Similarity of Word Meaning in Context with Lexical Substitutes and Translations. In Computational Linguistics and Intelligent Text Processing [Lecture Notes in Computer Science, 6608], ► pp. 238 ff.
This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.