Open-source Corpora: Using the net to fish for linguistic data

Sharoff, Serge

doi:10.1075/ijcl.11.4.05sha

Article published In: International Journal of Corpus Linguistics
Vol. 11:4 (2006) ► pp.435–462

Get fulltext from our e-platform

Download PDF

Open-source Corpora

Using the net to fish for linguistic data

Serge Sharoff | University of Leeds

Published online: 8 December 2006

https://doi.org/10.1075/ijcl.11.4.05sha

The paper proposes a methodology for collecting “open-source” corpora, i.e. corpora that are automatically collected from the Internet and distributed in the form of a list of links with open-source software for recreating their full text. The result is a random snapshot of Internet pages which contain stretches of connected text in a given language. The paper discusses a methodology for acquiring such corpora, two ways of documenting them (using a set of metatextual categories and by comparison to frequency lists from existing corpora) and their function as benchmarks for comparing results of linguistic inquiry. Experiments with a variety of languages show that Internet-derived corpora can be successfully used in the absence of large representative corpora that are rare and expensive to build.

Keywords: representative corpora, frequency lists, Internet, corpus composition

Cited by (45)

Cited by 45 other publications

Order by:

Sardoschau, Sulin & Annalí Casanueva-Artís

2025. The Cost of Tolerating Intolerance: Right-Wing Protest and Hate Crimes. SSRN Electronic Journal

Naser-Karajah, Eman & Nabil Arman

2024. Arabic Lexical Substitution: AraLexSubD Dataset and AraLexSub Pipeline. Data 9:8 ► pp. 98 ff.

Sharoff, Serge & Nenad Ivanović

2024. Sociolinguistic Variation in Slavic Languages. In The Cambridge Handbook of Slavic Linguistics, ► pp. 559 ff.

Westland, James Christopher & Jian Mou

2024. A stochastic model of the economics of Internet search. Journal of Electronic Business & Digital Economics 3:3 ► pp. 203 ff.

Alarcon, Rodrigo, Lourdes Moreno, Paloma Martínez & Natalia Grabar

2023. EASIER corpus: A lexical simplification resource for people with cognitive impairments. PLOS ONE 18:4 ► pp. e0283622 ff.

Alzahrani, Alaa

2023. LexArabic: A receptive vocabulary size test to estimate Arabic proficiency. Behavior Research Methods 56:6 ► pp. 5529 ff.

El-Farahaty, Hanem, Nouran Khallaf & Amani Alonayzan

2023. Building the Leeds Monolingual and Parallel Legal Corpora of Arabic and English Countries’ Constitutions: Methods, Challenges and Solutions. Corpus Pragmatics 7:2 ► pp. 103 ff.

Li, Jiaxuan, Jinghua Ou & Ming Xiang

2023. Context-specific effects of violated expectations: ERP evidence. Cognition 241 ► pp. 105628 ff.

Li, Rongying, Wenxiu Xie, John Lee & Tianyong Hao

2023. SCA-CLS: A New Semantic-Context-Aware Framework for Community-Oriented Lexical Simplification. In Natural Language Processing and Chinese Computing [Lecture Notes in Computer Science, 14302], ► pp. 69 ff.

Sharoff, Serge, Reinhard Rapp & Pierre Zweigenbaum

2023. Building Comparable Corpora. In Building and Using Comparable Corpora for Multilingual Natural Language Processing [Synthesis Lectures on Human Language Technologies, ], ► pp. 17 ff.

Sharoff, Serge, Reinhard Rapp & Pierre Zweigenbaum

2023. Introduction. In Building and Using Comparable Corpora for Multilingual Natural Language Processing [Synthesis Lectures on Human Language Technologies, ], ► pp. 1 ff.

Alrwaita, Najla, Lotte Meteyard, Carmel Houston-Price & Christos Pliatsikas

2022. Is There an Effect of Diglossia on Executive Functions? An Investigation among Adult Diglossic Speakers of Arabic. Languages 7:4 ► pp. 312 ff.

Li, Rongying, Wenxiu Xie, Jiayin Song, Leung-Pun Wong, Fu Lee Wang & Tianyong Hao

2022. 2022 IEEE International Symposium on Product Compliance Engineering - Asia (ISPCE-ASIA), ► pp. 1 ff.

Casanueva, Annalí

2021. Can Chants in the Street Change Parliament’s Tune? The Effects of the 15M Social Movement on Spanish Elections. SSRN Electronic Journal

jaschke, philipp, Sulin Sardoschau & Marco Tabellini

2021. Scared Straight? Threat and Assimilation of Refugees in Germany. SSRN Electronic Journal

Jaschke, Philipp, Sulin Sardoschau & Marco Tabellini

2022. Scared Straight? Threat and Assimilation of Refugees in Germany. SSRN Electronic Journal

Voelkel, Svenja & Franziska Kretzschmar

2021. Introducing Linguistic Research,

Kajiwara, Tomoyuki, Daiki Nishihara, Tomonori Kodaira & Mamoru Komachi

2020. Language Resources for Japanese Lexical Simplification. Journal of Natural Language Processing 27:4 ► pp. 801 ff.

Kehoe, Andrew

2020. Web Corpora. In A Practical Handbook of Corpus Linguistics, ► pp. 329 ff.

Schaeffer-Lacroix, Eva

2020. Les corpus web à travers le prisme de l’ALMT. Corpus :20

Song, Jiayin, Jingyue Hu, Leung-Pun Wong, Lap-Kei Lee & Tianyong Hao

2020. A New Context-Aware Method Based on Hybrid Ranking for Community-Oriented Lexical Simplification. In Database Systems for Advanced Applications. DASFAA 2020 International Workshops [Lecture Notes in Computer Science, 12115], ► pp. 80 ff.

Song, Jiayin, Yingshan Shen, John Lee & Tianyong Hao

2020. A Hybrid Model for Community-Oriented Lexical Simplification. In Natural Language Processing and Chinese Computing [Lecture Notes in Computer Science, 12430], ► pp. 132 ff.

TALALAKINA, EKATERINA, DENIS STUKAL & MIKHAIL KAMROTOV

2020. Developing and Validating an Academic Vocabulary List in Russian: A Computational Approach. The Modern Language Journal 104:3 ► pp. 618 ff.

Coupé, Christophe, Yoon Mi Oh, Dan Dediu & François Pellegrino

2019. Different languages, similar encoding efficiency: Comparable information rates across the human communicative niche. Science Advances 5:9

Masrai, Ahmed & James Milton

2019. How many words do you need to speak Arabic? An Arabic vocabulary size test. The Language Learning Journal 47:5 ► pp. 519 ff.

Atwell, Eric

2018. Classical and modern Arabic corpora. In Diachronic Corpora, Genre, and Language Change [Studies in Corpus Linguistics, 85], ► pp. 65 ff.

Dash, Niladri Sekhar & S. Arulmozi

2018. Web Text Corpus. In History, Features, and Typology of Language Corpora, ► pp. 125 ff.

Ibrahim, Anna, Patricia E. Cowell & Rosemary A. Varley

2017. Word frequency predicts translation asymmetry. Journal of Memory and Language 95 ► pp. 49 ff.

Biber, Douglas & Jesse Egbert

2016. Register Variation on the Searchable Web. Journal of English Linguistics 44:2 ► pp. 95 ff.

Biber, Douglas & Jesse Egbert

2018. Register Variation Online,

Biber, Douglas, Jesse Egbert & Mark Davies

2015. Exploring the composition of the searchable web: a corpus-based taxonomy of web registers. Corpora 10:1 ► pp. 11 ff.

Egbert, Jesse, Douglas Biber & Mark Davies

2015. Developing a bottom‐up, user‐based method of web register classification. Journal of the Association for Information Science and Technology 66:9 ► pp. 1817 ff.

Chang, Ching-Yun & Stephen Clark

2014. Practical Linguistic Steganography using Contextual Synonym Substitution and a Novel Vertex Coding Method. Computational Linguistics 40:2 ► pp. 403 ff.

Kilgarriff, Adam, Frieda Charalabopoulou, Maria Gavrilidou, Janne Bondi Johannessen, Saussan Khalil, Sofie Johansson Kokkinakis, Robert Lew, Serge Sharoff, Ravikiran Vadlapudi & Elena Volodina

2014. Corpus-based vocabulary lists for language learners for nine languages. Language Resources and Evaluation 48:1 ► pp. 121 ff.

Wilkens, Rodrigo, Alessandro Dalla Vecchia, Marcely Zanon Boito, Muntsa Padró & Aline Villavicencio

2014. Size Does Not Matter. Frequency Does. A Study of Features for Measuring Lexical Complexity. In Advances in Artificial Intelligence -- IBERAMIA 2014 [Lecture Notes in Computer Science, 8864], ► pp. 129 ff.

McCarthy, Diana, Ravi Sinha & Rada Mihalcea

2013. The cross-lingual lexical substitution task. Language Resources and Evaluation 47:3 ► pp. 607 ff.

Moon, Taesun & Katrin Erk

2013. An inference-based model of word meaning in context as a paraphrase distribution. ACM Transactions on Intelligent Systems and Technology 4:3 ► pp. 1 ff.

Wild, Kate, Andrew Church, Diana McCarthy & Jacquelin Burgess

2013. Quantifying lexical usage: vocabulary pertaining to ecosystems and the environment. Corpora 8:1 ► pp. 53 ff.

De Belder, Jan & Marie-Francine Moens

2012. A Dataset for the Evaluation of Lexical Simplification. In Computational Linguistics and Intelligent Text Processing [Lecture Notes in Computer Science, 7182], ► pp. 426 ff.

Fletcher, William H.

2012. Corpus Analysis of the World Wide Web. In The Encyclopedia of Applied Linguistics,

McCarthy, Diana & Roberto Navigli

2009. The English lexical substitution task. Language Resources and Evaluation 43:2 ► pp. 139 ff.

DELIN, J., S. SHAROFF, S. LILLFORD & C. BARNES

2007. Linguistic support for concept selection decisions. Artificial Intelligence for Engineering Design, Analysis and Manufacturing 21:2 ► pp. 123 ff.

McCarthy, Diana

2007. Computers getting the drift. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 365:1861 ► pp. 3019 ff.

McCarthy, Diana

2009. Word Sense Disambiguation: An Overview. Language and Linguistics Compass 3:2 ► pp. 537 ff.

McCarthy, Diana

2011. Measuring Similarity of Word Meaning in Context with Lexical Substitutes and Translations. In Computational Linguistics and Intelligent Text Processing [Lecture Notes in Computer Science, 6608], ► pp. 238 ff.

This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.