In:Recent Advances in Computational Terminology
Edited by Didier Bourigault, Christian Jacquemin and Marie-Claude L'Homme
[Natural Language Processing 2] 2001
► pp. 209–224
Extracting useful terms from parenthetical expressions by combining simple rules and statistical measures
A comparative evaluation of bigram statistics
Published online: 15 June 2001
https://doi.org/10.1075/nlp.2.11his
https://doi.org/10.1075/nlp.2.11his
One year’s worth of Japanese newspaper articles contains about 300,000 ‘parenthetical expressions (PEs)’, pairs of character strings A and B related to each other by parentheses as in A(B). These expressions contain a large number of important terms, such as organization names, company names, and their abbreviations, and are easily extracted by pattern matching.
We have developed a simple and accurate method for collecting unregistered terms from PEs which identified two types of PEs by using pattern matching, bigram statistics, and entropy, and collected about 17,000 terms with over 97% precision.
Bigram statistics, combined with a small number of rules, identified ‘pairs of exchangeable terms’ (PET) in PEs, such as , which mostly contained important terms and their abbreviations. Entropy worked to highlight inner PE terms (such as , which means company personnel affair), that were clues useful for acquiring proper nouns such as company names, organization names, and person names.
Identification of PETs provided the opportunity to evaluate the usefulness of various bigram co-occurrence statistics. Seven statistical measures (frequency, Mutual Information, the χ2-test, the χ2-test with Yates’ correction, the log-likelihood ratio, the Dice coefficient, and the modified Dice coefficient) were compared.
Cited by (4)
Cited by four other publications
Sun, Xu, Naoaki Okazaki, Jun’ichi Tsujii & Houfeng Wang
Sánchez, David & David Isern
Okazaki, Naoaki & Sophia Ananiadou
This list is based on CrossRef data as of 28 november 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
