Corpus analysis
Table of contents
Roughly, the data available for linguistic research stem from either of two sources: intuitions about language or observations of linguistic events. Collections of data of the latter kind are called corpora. Although corpus data have been used throughout the history of linguistic research, a real breakthrough in their use came in the course of the 20th century when it became possible to store and search large quantities of text electronically. In the second half of the previous century the use of corpus data in their new form was stimulated by the dissatisfaction felt by some with the preference of the linguistic mainstream for intuitive data. Positions taken with respect to the appropriateness for linguistic research of either corpus data or intuitive data have occasionally been quite extreme, but the best policy for any linguist is probably to regard the two as being complementary, rather than in opposition to each other. However, it must be borne in mind that corpus data reflect what people actually say and write, and as such provide the most appropriate data for linguists who want to investigate the use of language rather than linguistic competence or linguistic universals. And since the study of language use is not only concerned with the description of what people actually say and write, but also with the question why in a given verbal or situational context they use one linguistic construct rather than another, it follows that for a collection of linguistic events to be a corpus, it has to meet minimally two conditions. The first is that it should present a faithful record of the utterances contained in running texts (rather than, say, a collection of examples of a particular linguistic phenomenon), the second is that it should give information about the questions by whom, where, when and why the texts were produced. In other words, apart from a record of utterances, a corpus should contain the fullest possible information about the verbal and situational contexts in which the utterances were produced. The fact that corpora are repositories of language use entails that corpus-based studies are naturally biased towards the study of specific languages, genres and language varieties.