Creating and using Web corpora

Thelwall, Mike

doi:10.1075/ijcl.10.4.07the

Article published In: International Journal of Corpus Linguistics
Vol. 10:4 (2005) ► pp.517–541

Get fulltext from our e-platform

Download PDF

Creating and using Web corpora

Mike Thelwall | University of Wolverhampton

Published online: 7 November 2005

https://doi.org/10.1075/ijcl.10.4.07the

The Web has recently been used as a corpus for linguistic investigations, often with the help of a commercial search engine. We discuss some potential problems with collecting data from commercial search engine and with using the Web as a corpus. We outline an alternative strategy for data collection, using a personal Web crawler. As a case study, the university Web sites of three nations (Australia, New Zealand and the UK) were crawled. The most frequent words were broadly consistent with non-Web written English, but with some academic-related words amongst the top 50 most frequent. It was also evident that the university Web sites contained a significant amount of non-English text, and academic Web English seems to be more future-oriented than British National Corpus written English.

Keywords: academic language, web corpus, web

Cited by (5)

Cited by five other publications

Order by:

CANAN HÄNSEL, EVA & DAGMAR DEUBER

2013. Globalization, postcolonial Englishes, and the English language press in Kenya, Singapore, and Trinidad and Tobago. World Englishes 32:3 ► pp. 338 ff.

Perelmutter, Renee

2012. Interactive properties: Modern Russian predicate adjectives in affirmative and negative contexts. Russian Linguistics 36:1 ► pp. 65 ff.

Koteyko, Nelya

2010. Mining the internet for linguistic and social data: An analysis of ‘carbon compounds’ in Web feeds. Discourse & Society 21:6 ► pp. 655 ff.

Baroni, Marco, Silvia Bernardini, Adriano Ferraresi & Eros Zanchetta

2009. The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43:3 ► pp. 209 ff.

Nebot, Esther Monzó

2008. Corpus-based Activities in Legal Translator Training. The Interpreter and Translator Trainer 2:2 ► pp. 221 ff.

This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.