Lexical segments in text

Berber Sardinha, Tony

doi:10.1075/z.107.11ber

Editors’ introduction Berber Sardinha’s paper deals with a problem, namely text segmentation, which connects at several points with those of the other contributors to this volume. Like Scott, Sinclair and Coulthard, Berber Sardinha is interested in understanding the computer’s understanding of text, or rather the computer’s failure to handle the complexities of text satisfactorily. Like the other contributors who have been influenced by Hoey’s work on text patterning, his work is also concerned with the problem of identifying the stages which a text goes through as it moves from one component of a pattern to the next.

The problem is not trivial. Computer methods for processing text have already led to an explosion of text retrieval methods which anyone who uses Internet search engines knows, needs and curses. That is, a fairly simple technology is there to help us find all instances of a desired word or phrase in a database, or in the whole Internet, or on a given computer, and the uses to which this technology can be put are both text retrieval — to find the text one is searching for — and pedagogical: to learn about word collocation and colligation. But as Sinclair’s paper shows, such a technology may be efficient in its own terms but disconnected from the way human users relate to the world and to each other. Thus, a very large number of irrelevant hits are typically found, which usually hinder text retrieval as much as they help it and may also obscure and frustrate collocational inference.

It is likely that these problems will be best tackled by refinements to the techniques used, refinements which are very likely to involve questions central to the rest of this volume, concerning the aboutness of individual text segments, and the relations between text segments or elements. Thus, for information retrieval and language learning we certainly need to know much more than “which texts contain word x or phrase y?” and move towards “which texts are about z?” and “which segments of which texts are about p and not q?” and “where does the text change from explaining r to evaluating it?”. It is probable that as we learn to answer questions such as these, we shall be that much nearer to a truly useful text retrieval.

Berber Sardinha’s paper proposes a detailed and ingenious method for getting at the boundaries within a text, identifying its segments in the sense of changes in aboutness.

As with the other contributors using computer methods, the problems are as yet greater than the solutions encountered. It is therefore important to view the method being proposed in the right light: the purpose here as in so much else is to model the world; it is through insights arising from model-making, model application and model- testing that progress is eventually made.

Lexical segments in text

Cited by two other publications