Article published In: Lingvisticæ Investigationes
Vol. 47:1 (2024) ► pp.30–67
CRI
A Competent Reader Imitator for detecting binomial names in an historical corpus
Published online: 31 October 2024
https://doi.org/10.1075/li.00107.mor
https://doi.org/10.1075/li.00107.mor
Summary
La Nature (1873–1962) is a French popular science magazine that spanned a large time period and a
large range of topics. It is available via ocerized archives so that it forms a corpus that is simultaneously diachronous,
heterogeneous, and noisy. Although these characteristics make it complex to analyze, La Nature is of great
interest for digital humanities studies on the evolution of thoughts in science, technology, and even politics.
The work presented in this article is part of research on the semantic annotation of these archives, which is discovering clues
for exploring them. One type of clue that has not been explored in a complex corpus such as La Nature is
binomial names, or more specifically, the named entities that refer to the Linnean
classification of life, e.g., Escherichia coli. To overcome this complexity, the concept of a Competent
Reader, who can detect binomial names even when obsolete, non-standard or defaced by OCR, is introduced. By imitating
a Competent Reader, our approach, which we call the Competent Reader Imitator (CRI), involves combining a
rule-based approach with a frequency argument. We show that this innovative method is robust to numerous variations and
consistently achieves an F-measure of about 70% despite diachronicity, heterogeneity, and noise, which are all known to impede
named entity recognition. Our method has many potential applications, such as in the study of chemical names and names of
scientific and technical artifacts, which could benefit from the Competent Reader imitation approach. Beyond our work on
La Nature, we hope this paper provides a set of tools and methods that are easily understandable, frugal, and
usable for a general public interested in exploring similar historical corpus.
Article outline
- 1.Introduction
- 2.Context information
- 2.1The Linnean classification
- 2.2The corpus of La Nature
- 2.2.1Structure of the corpus
- 2.2.2Volumetry of the corpus
- 3.Creation of a gold standard
- 3.1The Competent Reader Hypothesis — CRH
- 3.2The task
- 3.3The manual annotation
- 4.State of the art and baseline
- 4.1State of the art
- 4.2LINNAEUS
- 4.3TAXREF classifier
- 4.4QUAESITOR
- 5.A Competent Reader Imitator
- 5.1Latin
- 5.2A Rare-Frequent Threshold
- 5.3Handling deviations from the code
- 6.Evaluation
- 6.1Comparison with the state of the art
- 6.2Ex-post evaluation
- 6.3Sensitivity to accepted patterns
- 6.4Sensitivity to thematic variability
- 7.Conclusion & future work
- 7.1Carbon impact
- 7.2Conclusion / discussion
- 7.3Future work
References
References (55)
Abdalla, M. & Abdalla, M. (2021). The
grey hoodie project: Big tobacco, big tech, and the threat on academic
integrity. In Proceedings of the 2021 AAAI/ACM Conference on AI,
Ethics, and Society, AIES
’21, 287–297. New York, NY, USA: Association for Computing Machinery.
Abdalla, M., Wahle, J. P., Ruas, T. L., Névéol, A., Ducel, F., Mohammad, S. M. & Fort, K. (2023). The
elephant in the room: Analyzing the presence of big tech in natural language processing
research. In A. Rogers et al. (Eds.), Proceedings
of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto,
Canada, July 9–14, 2023, 13141–13160: Association for Computational Linguistics.
Akella, L. M., Norton, C. & Miller, H. (2012). Netineti:
Discovery of scientific names from text using machine learning methods. BMC
Bioinformatics, 131, 211.
Bánki, O., Roskov, Y., Döring, M., Ower, G., Hernández Robles, D., Plata Corredor, C., Stjernegaard Jeppesen, T., Örn, A., Vandepitte, L., Hobern, D., Schalk, P., DeWalt, R., Ma, K., Miller, J., Orrell, T., Aalbu, R., Abbott, J., Adlard, R. & Adriaenssens, E. e. a. (2023). Catalogue
of Life Checklist.
Bannour, N., Ghannay, S., Névéol, A. & Ligozat, A. (2021). Evaluating
the carbon footprint of NLP methods: a survey and analysis of existing
tools. In N. S. Moosavi et al. (Eds.), Proceedings
of the Second Workshop on Simple and Efficient Natural Language Processing, SustaiNLP@EMNLP 2021, Virtual, November 10,
2021, 11–21: Association for Computational Linguistics.
Becker, C. (2023). Insolvent:
How to Reorient Computing for Just
Sustainability. Cambridge: The MIT Press.
Bender, E. M., Gebru, T., McMillan-Major, A. & Shmitchell, S. (2021). On
the dangers of stochastic parrots: Can language models be too
big? In Proceedings of the 2021 ACM Conference on Fairness,
Accountability, and Transparency, FAccT
’21, 610–623. New York, NY, USA: Association for Computing Machinery.
Birhane, A., Kalluri, P., Card, D., Agnew, W., Dotan, R. & Bao, M. (2022). The
values encoded in machine learning research. In Proceedings of the
2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT
’22, 173–184. New York, NY, USA: Association for Computing Machinery.
Burdick, A., Drucker, J., Lunenfeld, P., Presner, T. & Schnapp, J. (2012). Digital
Humanities. The MIT Press.
Castellan, S., Käfer, J. & Tannier, E. (2023). Back
to the trees: Identifying plants with Human Intelligence. In Ninth
Computing within Limits 2023: LIMITS. [URL].
Clark, A., Fox, C. & Lappin, S. (2012). The
handbook of computational linguistics and natural language
processing, volume 1181. John Wiley & Sons.
CNUM (ca. 2000). Conservatoire
numérique des Arts et Métiers. HTTP links to scanned fac-simile of LA
NATURE: [URL]
COMETS, Ethics Committee of the
CNRS (2022). AVIS n 2022–43, Intégrer les enjeux environnementaux à la
conduite de la recherche — Une responsabilité éthique.
Cunha, W., Mangaravite, V., Gomes, C., Canuto, S., Resende, E., Nascimento, C., Viegas, F., França, C., Martins, W. S., Almeida, J. M., Rosa, T., Rocha, L. & Gonçalves, M. A. (2021). On
the cost-effectiveness of neural and non-neural approaches and representations for text classification: A comprehensive
comparative study. Information Processing &
Management, 58(3), 102481.
DeMillo, R., Lipton, R. & Sayward, F. (1978). Hints
on test data selection: Help for the practicing
programmer. Computer, 11(4), 34–41.
Devlin, J., Chang, M., Lee, K. & Toutanova, K. (2019). BERT:
pre-training of deep bidirectional transformers for language understanding.
Ehrmann, M., Hamdi, A., Pontes, E. L., Romanello, M. & Doucet, A. (2021). Named
entity recognition and classification on historical documents: A
survey. CoRR, abs/2109.11406.
Eltyeb, S. & Salim, N. (2014). Chemical
named entities recognition: a review on approaches and
applications. Cheminform. 6:(17).
Gabrys, J., Pritchard, H. & Barratt, B. (2016). Just
good enough data: Figuring data citizenships through air pollution sensing and data
stories. Big Data &
Society, 3(2), 2053951716679677.
Gargominy, O., Tercerie, S., Régnier, C., Ramage, T., Dupont, P., Daszkiewicz, P. & Poncet, L. (2021). TAXREF
v15, référentiel taxonomique pour la France : méthodologie, mise en œuvre et
diffusion.
Gerner, M., Nenadic, G. & Bergman, C. M. (2010). Linnaeus:
a species name identification system for biomedical literature. BMC
Bioinformatics, 11(1), 1–17.
Gundersen, O. E., Gil, Y. & Aha, D. W. (2018). On
reproducible AI: Towards reproducible research, open science, and digital scholarship in AI
publications. AI Magazine, 391.
Gupta, U., Kim, Y. G., Lee, S., Tse, J., Lee, H.-H. S., Wei, G.-Y., Brooks, D. & Wu, C.-J. (2020). Chasing
carbon: The elusive environmental footprint of computing.
Jurafsky, D. & Martin, J. H. (2009). Speech
and language processing : an introduction to natural language processing, computational linguistics, and speech
recognition. Pearson Prentice Hall.
Koning, D., Sarkar, I. N. & Moritz, T. (2005). Taxongrab:
Extracting taxonomic names from text. Biodiversity
Informatics, 21, 79–82.
Labusch, K., Neudecker, C. & Zellhofer, D. (2019). Bert
for named entity recognition in contemporary and historic
german. In Proceedings of the 15th Conference on Natural Language
Processing (KONVENS 2019): Long
Papers, 1–9. Erlangen, Germany: German Society for Computational Linguistics & Language Technology.
Lannelongue, L., Grealey, J. & Inouye, M. (2021). Green
algorithms: Quantifying the carbon footprint of computation. Advanced
Science, 8(12), 2100707.
Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L. & Schwab, D. (2020). Flaubert:
Unsupervised language model pre-training for french. In N. Calzolari et al. (Eds.), Proceedings
of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11–16,
2020, 2479–2490: European Language Resources Association.
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H. & Kang, J. (2019). BioBERT:
a pre-trained biomedical language representation model for biomedical text
mining. Bioinformatics, 36(4), 1234–1240.
Little, D. (2020). Recognition
of Latin scientific names using artificial neural networks. Applications in Plant
Sciences, 81.
Luccioni, A. S., Viguier, S. & Ligozat, A.-L. (2023). Estimating
the carbon footprint of BLOOM, a 176b parameter language model. Journal of Machine Learning
Research, 24(253), 1–15.
Martin, L., Muller, B., Ortiz Suárez, P. J., Dupont, Y., Romary, L., de la Clergerie, É., Seddah, D. & Sagot, B. (2020). CamemBERT:
a tasty French language model. In Proceedings of the 58th Annual
Meeting of the Association for Computational
Linguistics, 7203–7219. Online: Association for Computational Linguistics.
Morand, C. & Ridoux, O. (2023). Extraction
dans des textes anciens d’entités nommées de type binômes de la classification linnéenne du vivant : une étude de
cas. Revue des Nouvelles Technologies de
l’Information, Extraction et Gestion des Connaissances,
RNTI-E-39, 417–424.
Mozzherin, D., Myltsev, A. & Patterson, D. (2017). “gnparser”:
A powerful parser for scientific names based on parsing expression grammar. BMC
Bioinformatics, 181.
Nadeau, D. & Sekine, S. (2007). A
survey of named entity recognition and classification. Lingvisticæ
Investigationes, 301, 3–26.
Nasar, Z., Jaffry, S. W. & Malik, M. (2021). Named
entity recognition and relation extraction: State of the art. ACM Computing
Surveys, 541.
Nédellec, C., Bessières, P., Bossy, R. R., Kotoujansky, A. & Manine, A.-P. (2006). Annotation
guidelines for machine learning-based named entity recognition in
microbiology. In Proceeding of Data and Text Mining for Integrative
Biology Workshop 17.European Conference on Machine Learning 10. European Conference on Principles and Practice of Knowledge
Discovery in Databases, Workshop on data and text mining for integrative biology.
Springer.
Nguyen, N. T. H., Gabud, R. & Ananiadou, S. (2019). Copious:
A gold standard corpus of named entities towards extracting species occurrence from biodiversity
literature. Biodiversity Data Journal.
Pafilis, E., Frankild, S. P., Fanini, L., Faulwetter, S., Pavloudi, C., Vasileiadou, A., Arvanitidis, C. & Jensen, L. J. (2013). The
species and organisms resources for fast and accurate identification of taxonomic names in
text. PloS
one, 8(6), e65390.
Sacco, G. M. & Tzitzikas, Y. (Eds.) (2009). Dynamic
Taxonomies and Faceted Search: Theory, Practice, and
Experience, volume 251 of The
Information Retrieval Series. Springer.
Santarius, T., Bieser, J. C. T., Frick, V., Höjer, M., Gossen, M., Hilty, L. M., Kern, E., Pohl, J., Rohde, F. & Lange, S. (2022). Digital
sufficiency: conceptual considerations for icts on a finite planet. Annals of
Telecommunications, 78(5–6), 277–295.
Sautter, G., Böhm, K. & Agosti, D. (2006). A
combining approach to find all taxon names (FAT). Biodiversity
Informatics, 31.
Seideh, M. A. F., Fehri, H. & Haddar, K. (2016). Named
entity recognition from arabic-french herbalism parallel
corpora. In T. Okrut et al. (Eds.), Automatic
Processing of Natural-Language Electronic Texts with
NooJ, 191–201. Cham: Springer International Publishing.
Sevilla, J., Heim, L., Ho, A., Besiroglu, T., Hobbhahn, M. & Villalobos, P. (2022). Compute
trends across three eras of machine learning. In 2022 International
Joint Conference on Neural Networks (IJCNN), 1–8.
Strubell, E., Ganesh, A. & McCallum, A. (2019). Energy
and policy considerations for deep learning in NLP. In Proceedings of
the 57th Annual Meeting of the Association for Computational
Linguistics, 3645–3650. Florence, Italy: Association for Computational Linguistics.
Thompson, N., Greenewald, K., Lee, K. & Manso, G. F. (2023). The
Computational Limits of Deep Learning. In Ninth Computing within
Limits 2023: LIMITS. [URL].
Tissandier, G. (1873–1962). LA
NATURE : Revue des sciences et de leurs applications aux arts et à l’industrie.
Turland, N. J., Wiersema, J. H., Barrie, F. R., Greuter, W., Hawksworth, D. L., Herendeen, P. S., Knapp, S., Kusber, W.-H., Li, D.-Z., Marhold, K., May, T. W., McNeill, J., Monro, A. M., Prado, J., Price, M. J. & Smith, G. F. (Eds.) (2018). International
Code of Nomenclature for algae, fungi, and plants (Shenzhen
Code). Glashütten: Koeltz Botanical Books.
