LwM Digital Residency: Unlocking the Past: Structured Data Extraction in 19th Century Chilean Newspapers

Written by Jennifer Hayward and the teamSeptember 19, 2023Comments: 0

Introduction

Despite the increased pace of digitization, providing access to the data embedded within historical archives remains a challenge, particularly when dealing with the degraded and inconsistent typefaces in century-old newspapers. Enter the Living with Machines Digital Residency. A cross-disciplinary project aimed at extracting structured data from 19th-century Chilean newspapers, this effort ushers in a new frontier for the Digital Humanities.

The Team and Mission

Our interdisciplinary team included:

  • Jennifer Hayward and Michelle Prain Brice (Department of Literature, Universidad Adólfo Ibáñez, Chile
  • Leonor Riesco (Department of History, Universidad Fines Terrae, Chile)
  • Gillian Valenzuela (Artificial Intelligence program, Universidad Adólfo Ibáñez, Chile)
  • Khandokar Shakib (PhD candidate, Computer Science program, Tulane University, USA).

This collaboration, guided by Kaspar Beelen from the Living with Machines project, combined expertise from literature, history, data science, and artificial intelligence. We sought to develop innovative techniques for data extraction; our goals were to increase the accessibility of these historical documents, provide structured economic and historical data to researchers, and develop a pipeline to train Chilean students in digital methods.

Image source: AnglophoneChile Project, “The Star of Chile No. 17” (26 November 1904), 2019. CC BY NC 4.0

Methodology: A Two-Pronged Approach

As Living with Machines researchers know all too well, 19th-century newspapers present several challenges when it comes to OCR. The most significant of these are the inconsistent and degraded typeface as well as the highly segmented and inconsistent formats. This made the extraction of structured data, such as headlines, articles, and images, a major hurdle.

Luckily for us, a team led by Kaspar Beelen had already run extensive experiments on best methods for extracting structured data from a complex format. Kaspar led us through the Press Directories project’s structured text extraction methods, highlighting the techniques that had the best potential to adapt for our digital archive of 19th-century English-language newspapers published in Chile. With a focus on overcoming the challenges presented by our digital data through an innovative pipeline, our project incorporated Optical Character Recognition (OCR) techniques, machine learning, and data cleanup methods like these:

  • Khandokar Shakib explored template matching and data extraction using NLP pipelines.
  • Gillian Valenzuela delved into recent machine-learning models like OpenAI’s LLM to refine the text extraction process.
  • Given the inaccuracies yielded by the extraction process, we also performed rigorous data cleanup to ensure the integrity of the extracted data, prepping it for further analysis.

Results: Success and Future Directions

Our efforts led to two successful pipelines that could automatically extract specific sections and tables from the historical newspapers. Our next steps include integrating more advanced computational models, thereby making the pipeline more accurate and efficient, and using the pipelines we’ve developed to extract economic information of great value to historians of British and global informal empire.

The dataset we’ve generated opens up new interpretative avenues for researchers—such as evaluating trade commodities, assessing economic value, and mapping out shipping routes—offering a trove of insights into the political, social, and economic dynamics of the 19th century.

Acknowledgments

A big shout-out to the Living with Machines Digital Residencies program for their incredible support, and to our talented collaborators for their invaluable contributions to the project!

By delving into the past with cutting-edge technology, we’re not just preserving history; we’re making it. Explore our GitHub repositories to learn more about our methods and check out our website, AnglophoneChile.org, and our digital archive, news-archive.anglophonechile.org.



Our Funder and Partners