Chapter 10. Challenges and strategies for beginners to solve research questions with DH methodologies on a corpus of multilingual Philippine periodicals

Ortuño Casanova, Rocío

doi:10.1075/btl.155.10ort

In:Literary Translation in Periodicals: Methodological challenges for a transnational approach
Edited by Laura Fólica, Diana Roig-Sanz and Stefania Caristia
[Benjamins Translation Library 155] 2020
► pp. 247–272

Get fulltext

Download Chapter PDF

Chapter 10
Challenges and strategies for beginners to solve research questions with DH methodologies on a corpus of multilingual Philippine periodicals

Rocío Ortuño Casanova | University of Antwerp

Available under the Creative Commons Attribution-NonCommercial-NoDerivatives (CC BY-NC-ND) 4.0 license.

For any use beyond this license, please contact the publisher at rights@benjamins.nl.

Published online: 10 December 2020

https://doi.org/10.1075/btl.155.10ort

Abstract

A usually mentioned problem in Digital Humanities (DH) is the difficult fit between Humanities research questions and DH methodologies. This chapter is therefore configured as a meta-chapter that explains the problems and strategies when exploring the multilingual repository of Philippine periodicals constructed within the project “Strenghthening Digital Research at the UP System” in order to research the evolution of the image of China in these periodicals. The two main challenges found for analysing the periodicals to find an answer have been (1) Problematic OCRs, (2) Research across multi-lingual publications. The chapter lists literature and research projects that have approached similar questions and challenges in comparable corpora. Some suggestions of tools to address them will also be provided.

Keywords: Philippine rare periodicals, multilingual text analysis, representation, low-resource languages, OCR, online repository, challenges in digital humanities

Article outline

The PhilPeriodicals project
The research question
Approaches to studying a country’s representation in the periodical press
First difficulty: How to prepare a set of plurilingual texts?
- The problem with OCR
- Translation
Tools
What would researchers in the humanities need from a periodicals repository in the 21st century?
Notes
References

References (74)

References

“About Newspapers”. n.d. Trove. Accessed 25 January 2019. [URL]

“Aims”. n.d. Accessed 25 January 2019. [URL]

“Antwerp Centre for Digital Humanities and Literary Criticism – ACDC – University of Antwerp”. n.d. Accessed 25 January 2019. [URL]

“Archivo China España, 1800–1950”. n.d. Accessed 4 November 2018. [URL]

Benson, Rodney, and Erik Neveu. 2005. “Introduction: Field Theory as a Work in Progress”. In Bourdieu and the Journalistic Field, 1–24. Cambridge, UK: Polity Press.

“Bibliographical Data (BiblioData) | DARIAH”. n.d. Accessed 2 February 2020. [URL]

Calamari-OCR/Calamari. (2018) 2020. Python. Calamari-OCR. [URL]

Cano, Glòria. 2008. De Tartessos a Manila: Siete estudios coloniales y poscoloniales. Edición: 1. València: Publicacions de la Universitat de València.

Castells, P., F. Perdrix, E. Pulido, M. Rico, R. Benjamins, J. Contreras, and J. Lorés. 2004. “Neptuno: Semantic Web Technologies for a Digital Newspaper Archive”. In The Semantic Web: Research and Applications, edited by Christoph J. Bussler, John Davies, Dieter Fensel, and Rudi Studer, 445–58. Lecture Notes in Computer Science. Springer Berlin Heidelberg.

Castelvecchi, Davide. 2016. “Deep Learning Boosts Google Translate Tool”. Nature News.

Chaudhury, K., A. Jain, S. Thirthala, V. Sahasranaman, S. Saxena, and S. Mahalingam. 2009. “Google Newspaper Search Amp;#150; Image Processing and Analysis Pipeline”. In 2009 10th International Conference on Document Analysis and Recognition, 621–25.

Comenge, Rafael. 1894. Cuestiones filipinas. 1a. parte. Los Chinos. (Estudio social y político). Manila: Tipolitografía de Chofré y compañía.

Cordell, Ryan. n.d. “Our Project Team”. Accessed 2 February 2020. [URL]

Crompton, Constance, Richard J. Lane, and Ray Siemens. 2016. Doing Digital Humanities: Practice, Training, Research. Taylor & Francis.

“D*/DTA Search”. n.d. Accessed 25 January 2019. [URL]

“Delpher – Boeken Kranten Tijdschriften”. n.d. Accessed 25 January 2019. [URL]

Eijnatten, Joris van, Toine Pieters, and Jaap Verheul. 2014. “Using Texcavator to Map Public Discourse”. Tijdschrift Voor Tijdschriftstudies, July, 59–65.

Elizalde Pérez-Grueso, María Dolores. 2008. “China – España – Filipinas: percepciones españolas de China – y de los chinos – en el siglo XIX”. Huarte de San Juan. Geografía e historia, no. 15: 101–11. [URL]

Figueroa, José Cardona. (2015) 2018. Contribute to JoseCardonaFigueroa/Sentiment-Analysis-Spanish Development by Creating an Account on GitHub. R. [URL]

“Fire Breaks out at UP Diliman Campus”. 2016. Cnn. 2016. [URL]

“Fire Hits National Archives Building”. 2018. Philstar.Com. 28 May 2018. [URL]

GMA News Online. 2016. “Namria Discovers 400 to 500 New Islands in PHL Archipelago”, 2016. [URL]

Gu, Jiatao, Hany Hassan, Jacob Devlin, and Victor O. K. Li. 2018. “Universal Neural Machine Translation for Extremely Low Resource Languages”. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 344–354. New Orleans, Louisiana: Association for Computational Linguistics.

Guenter, Muehlberger, and Guenter Hackl. 2019. “NewsEye / READ OCR training dataset from Austrian Newspapers (19th C.)”. Zenodo.

Haaf, Susanne, Frank Wiegand, and Alexander Geyken. 2013. “Measuring the Correctness of Double-Keying: Error Classification and Quality Control in a Large Corpus of TEI-Annotated Historical Text”. Journal of the Text Encoding Initiative, no. Issue 4 (March).

Hanumanthappa, M., and Deepa Nagalavi. 2015. “Identification and Extraction of Headlines from Online English Newspaper- Statistical Approach” 10 (January): 19–22.

Hébert, David, Thomas Palfray, Stephane Nicolas, Pierrick Tranouez, and Thierry Paquet. 2014. “Automatic article extraction in old newspapers digitized collections”. In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage (DATeCH ’14). Association for Computing Machinery, New York, 3–8.

Hedges, Mark, and Stuart Dunn. 2017. Academic Crowdsourcing in the Humanities: Crowds, Communities and Co-Production. Chandos Publishing.

“IIIF Newspapers – Devwiki”. n.d. Accessed 25 January 2019. [URL]

“IIIF Newspapers Community Group – IIIF | International Image Interoperability Framework”. n.d. Accessed 25 January 2019. [URL]

Impresso. 2018. “Moving beyond Digital Filters. How to Integrate the Digitised Press into the Historian’s Workflow”. Blogpost. Impresso. 6 July 2018. [URL]

“Issue 10: Innovation Agenda”. n.d. Europeana Pro. Accessed 3 February 2020. [URL]

Jockers, Matthew Lee. 2014. Text Analysis with R for Students of Literature.

Jordana y Morera, Ramón. 1888. La inmigración china en Filipinas. Madrid: Tipografía de Manuel G. Hernández.

Kettunen, Kimmo, Tuula Pääkkönen, and Erno Liukkonen. 2019. Clipping the Page -Automatic Article Detection and Marking Software in Production of Newspaper Clippings of a Digitized Historical Journalistic Collection.

“Kraken – Kraken 2.0.5-4-Gbb42ba5 Documentation”. n.d. Accessed 1 February 2020. [URL]

La Inmigración China y Japonesa En Filipinas: Documentos. 1892. Madrid: Imprenta de Don Luis Aguado.

Lagrama, Eimee Rhea C. 2012. “Preventing Disaster: Quantifying Risks at the UP Diliman University Library”. In Libraries, Archives and Museums: Common Challenges, Unique Approaches, 10. Rizal Library. Ateneo de Manila University.

“LASER NLP Toolkit: Zero-Shot Transfer across 93 Languages”. 2019. 22 January 2019. [URL]

Li, David Leiwei. 2003. Globalization and the Humanities. Hong Kong University Press.

Los chinos en Filipinas: Males que se experimentan actualmente y peligros de esa creciente inmigración. 1886. Manila: Establecimiento tipográfico de La Oceanía Española.

“Netherlands EScience Center”. n.d. Accessed 29 January 2019. [URL]

Netherlands EScience Center: Shifting Concepts Through Time Project – NLeSC/ShiCo. (2015) 2018. Python. Netherlands eScience Center. [URL]

Neudecker, C., and A. Antonacopoulos. 2016. “Making Europe’s Historical Newspapers Searchable”. In 2016 12th IAPR Workshop on Document Analysis Systems (DAS), 405–10.

“OCR”. 2019. 13. EuropeanaTech. Europeana. [URL]

“On Multilingual Dynamic Topic Modeling”. n.d. Accessed 2 February 2020. [URL]

Ortuño, Casanova Rocío. 2017. “Philippine Literature in Spanish: Canon Away from Canon”. Iberoromania 2017 (85): 58–77.

Ortuño Casanova, Rocío and Anna Sarmiento. 2020. “Humanidades Digitales en Filipinas: proyectos, dificultades y oportunidades de la colaboración Norte-Sur”. Digital Scholarship in the Humanities, fqz086.

“Our Research Center”. 2014. HathiTrust Digital Library. 2014. [URL]

Pa, Win Pa, Ye Kyaw Thu, Andrew Finch, and Eiichiro Sumita. 2016. “A Study of Statistical Machine Translation Methods for Under Resourced Languages”. Procedia Computer Science, SLTU-2016 5th Workshop on Spoken Language Technologies for Under-resourced languages 09–12 May 2016 Yogyakarta, Indonesia, 81 (January): 250–57.

Palfray, Thomas, David Hebert, Stéphane Nicolas, Pierrick Tranouez, and Thierry Paquet. 2012. “Logical segmentation for article extraction in digitized old newspapers”. In Proceedings of the 2012 ACM symposium on Document engineering (DocEng ’12). Association for Computing Machinery, New York, 129–132.

“Philippines”. n.d. Ethnologue. Accessed 18 September 2018. [URL]

Piotrkowicz, Alicja, Vania Dimitrova, and Katja Markert. 2017. “Automatic Extraction of News Values from Headline Text”. In Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics, 64–74. Valencia, Spain: Association for Computational Linguistics. [URL].

Plale, Beth, Robert McDonald, Yiming Sun, Inna Kouper, Ryan Cobine, J. Stephen Downie, Beth Sandore Namachchivaya, and John Unsworth. 2013. “HathiTrust Research Center: Computational Access for Digital Humanities and Beyond”. In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, 395–396. JCDL ’13. New York, NY, USA: ACM.

Ponce, Mariano. 1912. Sun Yat Sen: El Fundador de La República de China. Manila: Imprenta de la Vanguardia y Taliba.

Prado-Fonts, Carles. 2018. “Writing China from the Rest of the West: Travels and Transculturation in 1920s Spain”. Journal of Spanish Cultural Studies, April.

“READ | EADH – The European Association for Digital Humanities”. n.d. Accessed 25 January 2019. [URL]

Saldaña, Zoë Wilkinson. 2018. “Sentiment Analysis for Exploratory Data Analysis”. Programming Historian, January. [URL].

Ströbel, Phillip, and Simon Clematide. 2019. “Improving OCR of Black Letter in Historical Newspapers: The Unreasonable Effectiveness of HTR Models on Low-Resolution Images”. In Digital Humanities 2019. Utrecht.

Tesseract-Ocr/Tesseract. (2014) 2020. C++. tesseract-ocr. [URL]

“Texcavator”. n.d. Accessed 25 January 2019. [URL]

“Text Correction Hall of Fame”. n.d. Trove. Accessed 25 January 2019. [URL]

Tom. (2014) 2020. Tmbdev/Ocropy. Jupyter Notebook. [URL]

“Transatlantis Locations”. n.d. Translantis. Accessed 25 January 2019. [URL]

“Transkribus”. n.d. Accessed 25 January 2019. [URL]

“Trove – Digitised Newspapers and More”. n.d. Trove. Accessed 25 January 2019. [URL]

“Unsupervised MT: Fast and Accurate for More Languages”. 2018. Facebook Engineering (blog). 31 August 2018. [URL]

Vanetik, Natalia, and Marina Litvak. 2019. Multilingual Text Analysis: Challenges, Models, And Approaches.

Viola, Lorella, and Jaap Verheul. 2019. “The Media Construction of Italian Identity: A Transatlantic, Digital Humanities Analysis of Italianità, Ethnicity, and Whiteness, 1867–1920”. Identity 19 (4): 294–312.

“Welsh Newspapers Online – Home”. n.d. Accessed 25 January 2019. [URL]

Wijfjes, Huub. 2017. “Digital Humanities and Media History. A Challenge for Historical Newspaper Research”. Tijdschrift Voor Mediageschiedenis 20 (1): 4–24.

Willems, Marieke, and Rossitza Atanassova. 2015. “Europeana Newspapers: Searching Digitized Historical Newspapers from 23 European Countries”. Insights 28 (1): 51–56.

“Xtas, the EXtensible Text Analysis Suite – Xtas 3.4 Documentation”. n.d. Accessed 29 January 2019. [URL]

Zosa, Elaine, and Mark Granroth-Wilding. 2019. “Multilingual Dynamic Topic Model”. Edited by Galia Angelova, Ruslan Mitkov, Ivelina Nikolova, and Irina Temnikova. RANLP 2019 – Natural Language Processing a Deep Learning World, International conference Recent advances in natural language processing, September, 1388–96. [URL].

Cited by (1)

Cited by one other publication

Roig-Sanz, Diana & Laura Fólica

2021. Big translation history. Translation Spaces 10:2 ► pp. 231 ff.

This list is based on CrossRef data as of 3 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.

Chapter 10Challenges and strategies for beginners to solve research questions with DH methodologies on a corpus of multilingual Philippine periodicals

Cited by one other publication

Chapter 10
Challenges and strategies for beginners to solve research questions with DH methodologies on a corpus of multilingual Philippine periodicals