CRI: A Competent Reader Imitator for detecting binomial names in an historical corpus

Morand, Clément; Ridoux, Olivier

doi:10.1075/li.00107.mor

Article published In: Lingvisticæ Investigationes
Vol. 47:1 (2024) ► pp.30–67

Get fulltext from our e-platform

Download PDF

Download EPUB

CRI

A Competent Reader Imitator for detecting binomial names in an historical corpus

Clément Morand | Université Paris-Saclay

Olivier Ridoux | IRISA — University of Rennes

Published online: 31 October 2024

https://doi.org/10.1075/li.00107.mor

Summary

La Nature (1873–1962) is a French popular science magazine that spanned a large time period and a large range of topics. It is available via ocerized archives so that it forms a corpus that is simultaneously diachronous, heterogeneous, and noisy. Although these characteristics make it complex to analyze, La Nature is of great interest for digital humanities studies on the evolution of thoughts in science, technology, and even politics. The work presented in this article is part of research on the semantic annotation of these archives, which is discovering clues for exploring them. One type of clue that has not been explored in a complex corpus such as La Nature is binomial names, or more specifically, the named entities that refer to the Linnean classification of life, e.g., Escherichia coli. To overcome this complexity, the concept of a Competent Reader, who can detect binomial names even when obsolete, non-standard or defaced by OCR, is introduced. By imitating a Competent Reader, our approach, which we call the Competent Reader Imitator (CRI), involves combining a rule-based approach with a frequency argument. We show that this innovative method is robust to numerous variations and consistently achieves an F-measure of about 70% despite diachronicity, heterogeneity, and noise, which are all known to impede named entity recognition. Our method has many potential applications, such as in the study of chemical names and names of scientific and technical artifacts, which could benefit from the Competent Reader imitation approach. Beyond our work on La Nature, we hope this paper provides a set of tools and methods that are easily understandable, frugal, and usable for a general public interested in exploring similar historical corpus.

Keywords: named-entity recognition, binomial names, historical corpus, digital sufficiency

Article outline

1.Introduction
2.Context information
- 2.1The Linnean classification
- 2.2The corpus of La Nature
  - 2.2.1Structure of the corpus
  - 2.2.2Volumetry of the corpus
3.Creation of a gold standard
- 3.1The Competent Reader Hypothesis — CRH
- 3.2The task
- 3.3The manual annotation
4.State of the art and baseline
- 4.1State of the art
- 4.2LINNAEUS
- 4.3TAXREF classifier
- 4.4QUAESITOR
5.A Competent Reader Imitator
- 5.1Latin
- 5.2A Rare-Frequent Threshold
- 5.3Handling deviations from the code
6.Evaluation
- 6.1Comparison with the state of the art
- 6.2Ex-post evaluation
- 6.3Sensitivity to accepted patterns
- 6.4Sensitivity to thematic variability
7.Conclusion & future work
- 7.1Carbon impact
- 7.2Conclusion / discussion
- 7.3Future work
References

References (55)

References

Abdalla, M. & Abdalla, M. (2021). The grey hoodie project: Big tobacco, big tech, and the threat on academic integrity. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’21, 287–297. New York, NY, USA: Association for Computing Machinery.

Abdalla, M., Wahle, J. P., Ruas, T. L., Névéol, A., Ducel, F., Mohammad, S. M. & Fort, K. (2023). The elephant in the room: Analyzing the presence of big tech in natural language processing research. In A. Rogers et al. (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9–14, 2023, 13141–13160: Association for Computational Linguistics.

Akella, L. M., Norton, C. & Miller, H. (2012). Netineti: Discovery of scientific names from text using machine learning methods. BMC Bioinformatics, 131, 211.

Bánki, O., Roskov, Y., Döring, M., Ower, G., Hernández Robles, D., Plata Corredor, C., Stjernegaard Jeppesen, T., Örn, A., Vandepitte, L., Hobern, D., Schalk, P., DeWalt, R., Ma, K., Miller, J., Orrell, T., Aalbu, R., Abbott, J., Adlard, R. & Adriaenssens, E. e. a. (2023). Catalogue of Life Checklist.

Bannour, N., Ghannay, S., Névéol, A. & Ligozat, A. (2021). Evaluating the carbon footprint of NLP methods: a survey and analysis of existing tools. In N. S. Moosavi et al. (Eds.), Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing, SustaiNLP@EMNLP 2021, Virtual, November 10, 2021, 11–21: Association for Computational Linguistics.

Barrière, C. (2016). Natural Language Understanding in a Semantic Web Context. Springer.

Becker, C. (2023). Insolvent: How to Reorient Computing for Just Sustainability. Cambridge: The MIT Press.

Bender, E. M., Gebru, T., McMillan-Major, A. & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, 610–623. New York, NY, USA: Association for Computing Machinery.

Birhane, A., Kalluri, P., Card, D., Agnew, W., Dotan, R. & Bao, M. (2022). The values encoded in machine learning research. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, 173–184. New York, NY, USA: Association for Computing Machinery.

Burdick, A., Drucker, J., Lunenfeld, P., Presner, T. & Schnapp, J. (2012). Digital Humanities. The MIT Press.

Castellan, S., Käfer, J. & Tannier, E. (2023). Back to the trees: Identifying plants with Human Intelligence. In Ninth Computing within Limits 2023: LIMITS. [URL].

Clark, A., Fox, C. & Lappin, S. (2012). The handbook of computational linguistics and natural language processing, volume 1181. John Wiley & Sons.

CNUM (ca. 2000). Conservatoire numérique des Arts et Métiers. HTTP links to scanned fac-simile of LA NATURE: [URL]

COMETS, Ethics Committee of the CNRS (2022). AVIS n 2022–43, Intégrer les enjeux environnementaux à la conduite de la recherche — Une responsabilité éthique.

Cunha, W., Mangaravite, V., Gomes, C., Canuto, S., Resende, E., Nascimento, C., Viegas, F., França, C., Martins, W. S., Almeida, J. M., Rosa, T., Rocha, L. & Gonçalves, M. A. (2021). On the cost-effectiveness of neural and non-neural approaches and representations for text classification: A comprehensive comparative study. Information Processing & Management, 58(3), 102481.

DeMillo, R., Lipton, R. & Sayward, F. (1978). Hints on test data selection: Help for the practicing programmer. Computer, 11(4), 34–41.

Devlin, J., Chang, M., Lee, K. & Toutanova, K. (2019). BERT: pre-training of deep bidirectional transformers for language understanding.

Ehrmann, M., Hamdi, A., Pontes, E. L., Romanello, M. & Doucet, A. (2021). Named entity recognition and classification on historical documents: A survey. CoRR, abs/2109.11406.

Eltyeb, S. & Salim, N. (2014). Chemical named entities recognition: a review on approaches and applications. Cheminform. 6:(17).

Gabrys, J., Pritchard, H. & Barratt, B. (2016). Just good enough data: Figuring data citizenships through air pollution sensing and data stories. Big Data & Society, 3(2), 2053951716679677.

Gargominy, O., Tercerie, S., Régnier, C., Ramage, T., Dupont, P., Daszkiewicz, P. & Poncet, L. (2021). TAXREF v15, référentiel taxonomique pour la France : méthodologie, mise en œuvre et diffusion.

Gerner, M., Nenadic, G. & Bergman, C. M. (2010). Linnaeus: a species name identification system for biomedical literature. BMC Bioinformatics, 11(1), 1–17.

Gundersen, O. E., Gil, Y. & Aha, D. W. (2018). On reproducible AI: Towards reproducible research, open science, and digital scholarship in AI publications. AI Magazine, 391.

Gupta, U., Kim, Y. G., Lee, S., Tse, J., Lee, H.-H. S., Wei, G.-Y., Brooks, D. & Wu, C.-J. (2020). Chasing carbon: The elusive environmental footprint of computing.

Jurafsky, D. & Martin, J. H. (2009). Speech and language processing : an introduction to natural language processing, computational linguistics, and speech recognition. Pearson Prentice Hall.

Koning, D., Sarkar, I. N. & Moritz, T. (2005). Taxongrab: Extracting taxonomic names from text. Biodiversity Informatics, 21, 79–82.

Kuhn, T. S. (1962). The Structure of Scientific Revolutions. Chicago: University of Chicago Press.

Labusch, K., Neudecker, C. & Zellhofer, D. (2019). Bert for named entity recognition in contemporary and historic german. In Proceedings of the 15th Conference on Natural Language Processing (KONVENS 2019): Long Papers, 1–9. Erlangen, Germany: German Society for Computational Linguistics & Language Technology.

Lannelongue, L., Grealey, J. & Inouye, M. (2021). Green algorithms: Quantifying the carbon footprint of computation. Advanced Science, 8(12), 2100707.

Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L. & Schwab, D. (2020). Flaubert: Unsupervised language model pre-training for french. In N. Calzolari et al. (Eds.), Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11–16, 2020, 2479–2490: European Language Resources Association.

Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H. & Kang, J. (2019). BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234–1240.

Little, D. (2020). Recognition of Latin scientific names using artificial neural networks. Applications in Plant Sciences, 81.

Luccioni, A. S., Viguier, S. & Ligozat, A.-L. (2023). Estimating the carbon footprint of BLOOM, a 176b parameter language model. Journal of Machine Learning Research, 24(253), 1–15.

Martin, L., Muller, B., Ortiz Suárez, P. J., Dupont, Y., Romary, L., de la Clergerie, É., Seddah, D. & Sagot, B. (2020). CamemBERT: a tasty French language model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7203–7219. Online: Association for Computational Linguistics.

Morand, C. & Ridoux, O. (2023). Extraction dans des textes anciens d’entités nommées de type binômes de la classification linnéenne du vivant : une étude de cas. Revue des Nouvelles Technologies de l’Information, Extraction et Gestion des Connaissances, RNTI-E-39, 417–424.

Mozzherin, D., Myltsev, A. & Patterson, D. (2017). “gnparser”: A powerful parser for scientific names based on parsing expression grammar. BMC Bioinformatics, 181.

Nadeau, D. & Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticæ Investigationes, 301, 3–26.

Nasar, Z., Jaffry, S. W. & Malik, M. (2021). Named entity recognition and relation extraction: State of the art. ACM Computing Surveys, 541.

NCBI (2008). The national center for biotechnology information taxonomy.

Nédellec, C., Bessières, P., Bossy, R. R., Kotoujansky, A. & Manine, A.-P. (2006). Annotation guidelines for machine learning-based named entity recognition in microbiology. In Proceeding of Data and Text Mining for Integrative Biology Workshop 17.European Conference on Machine Learning 10. European Conference on Principles and Practice of Knowledge Discovery in Databases, Workshop on data and text mining for integrative biology. Springer.

Nguyen, N. T. H., Gabud, R. & Ananiadou, S. (2019). Copious: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature. Biodiversity Data Journal.

Pafilis, E., Frankild, S. P., Fanini, L., Faulwetter, S., Pavloudi, C., Vasileiadou, A., Arvanitidis, C. & Jensen, L. J. (2013). The species and organisms resources for fast and accurate identification of taxonomic names in text. PloS one, 8(6), e65390.

Sacco, G. M. & Tzitzikas, Y. (Eds.) (2009). Dynamic Taxonomies and Faceted Search: Theory, Practice, and Experience, volume 251 of The Information Retrieval Series. Springer.

Santarius, T., Bieser, J. C. T., Frick, V., Höjer, M., Gossen, M., Hilty, L. M., Kern, E., Pohl, J., Rohde, F. & Lange, S. (2022). Digital sufficiency: conceptual considerations for icts on a finite planet. Annals of Telecommunications, 78(5–6), 277–295.

Sautter, G., Böhm, K. & Agosti, D. (2006). A combining approach to find all taxon names (FAT). Biodiversity Informatics, 31.

Schwartz, R., Dodge, J., Smith, N. A. & Etzioni, O. (2020). Green AI. Commun. ACM, 63(12), 54–63.

Seideh, M. A. F., Fehri, H. & Haddar, K. (2016). Named entity recognition from arabic-french herbalism parallel corpora. In T. Okrut et al. (Eds.), Automatic Processing of Natural-Language Electronic Texts with NooJ, 191–201. Cham: Springer International Publishing.

Sevilla, J., Heim, L., Ho, A., Besiroglu, T., Hobbhahn, M. & Villalobos, P. (2022). Compute trends across three eras of machine learning. In 2022 International Joint Conference on Neural Networks (IJCNN), 1–8.

Smil, V. (2021). Grand Transitions: How the Modern World Was Made. Oxford: OUP.

Strubell, E., Ganesh, A. & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 3645–3650. Florence, Italy: Association for Computational Linguistics.

Thompson, N., Greenewald, K., Lee, K. & Manso, G. F. (2023). The Computational Limits of Deep Learning. In Ninth Computing within Limits 2023: LIMITS. [URL].

Tissandier, G. (1873–1962). LA NATURE : Revue des sciences et de leurs applications aux arts et à l’industrie.

Turland, N. J., Wiersema, J. H., Barrie, F. R., Greuter, W., Hawksworth, D. L., Herendeen, P. S., Knapp, S., Kusber, W.-H., Li, D.-Z., Marhold, K., May, T. W., McNeill, J., Monro, A. M., Prado, J., Price, M. J. & Smith, G. F. (Eds.) (2018). International Code of Nomenclature for algae, fungi, and plants (Shenzhen Code). Glashütten: Koeltz Botanical Books.

Vautrin, G. (2018). Histoire de la vulgarisation scientifique avant 1900 (History of Science Popularization before 1900 — in France). EDP sciences.

Yu, P. & Wang, X. (2020). Bert-based named entity recognition in chinese twenty-four histories. In G. Wang et al. (Eds.), Web Information Systems and Applications, 289–301. Cham: Springer International Publishing.