Over half of a million pages of historical newspapers now openly available

Written by Giorgia TolfoMarch 7, 2023Comments: 0

Today the British Library and Living with Machines project released the text of 50 historical digitised newspaper titles on the British Library’s research repository.

The datasets of out of copyright volumes up to 1902 are in a specialist format called METS/ALTO XML. They are supplemented by additional ‘free to view’ records in the British Newspaper Archive.

We look forward to seeing what you do with the data! Please get in touch at digitalresearch@bl.uk if you have questions or want to share how you used the newspapers.

This blog post tells you how we created this newspaper collection and where you can find it.

Background

From the project’s start our goals included digitising historical newspapers held at the British Library to support research into the long nineteenth century (c 1780s-1920), prioritising industrial areas that were under-represented in the British Newspaper Archive. After years of imaging, post-capture processing, ingest and rights clearance, our work is complete.

Selecting titles for digitisation

The British Library collection contains over 60 million individual newspaper issues. With such a vast collection and such open research questions, selecting the right papers for digitisation was not an easy task.

We’ve described some of the practical and intellectual challenges of selecting newspaper titles in our post about ‘Press Picker: visualising formats and title name changes in the British Library’s newspaper holdings‘.

Close view of a column of ads in a historical newspaper, bound into a volume — Newspapers bound into volumes can be harder to digitise

Importantly, to help expedite the process, we prioritised longer and more consistent runs, and those with more microfilm available over shorter and physical-only titles.

Scanning speed, quality and cost depend on the source format. Digitising from microfilms is quicker and cheaper, but the quality of the final image is less optimal due to the possible deterioration of the acetate as well as of the lower quality of the original image. On the other hand, physical copies are more complex and delicate to handle. In many cases the volumes holding the papers needed extra conservation assessment, or tight bindings loosened – work that can add delays and increase costs. Prioritising microfilms, although less preferable from a curatorial point of view, allowed us to get the results to our researchers sooner.

Luckily the microfilms were all in good condition, except for some reels which were flagged in the Press Picker and discarded at the time of selection.

Scanning

Photo of a worker bent over a table with a large volume of newspapers under a scanner top

After copyright assessment, our list was sent to the British Library’s Imaging Service Team, who retrieved the microfilms and bound physical volumes and proceeded with scanning.

During scanning, each newspaper run was divided into years – the basic unit of organisation – and issues. The frequency of issues, the number of pages and the length of the run varies by newspaper title, which makes it difficult to assess the number of pages in each title in advance. This was a challenge for budgeting and planning timelines.

Sources that help calculate the number of pages in a title are the Waterloo Directories and original 19th century Press Directories. The Mitchell’s Press Directories (download link) are particularly important as they also provide further information like political leaning, price and circulation of the newspaper, as well as the names of owners.

Once scanned and quality checked, each newspaper run was delivered to Findmypast for further processing, including optical character and layout recognition (OCR and OLR) and packaging. Each year of publication of a specific newspaper is grouped into a zip file with a folder per issue and, within it, an ALTO/XML file per page and a METS/XML file as a metadata wrapper.

Delivery

Once the material was OCR-ed and prepped it was delivered by Findmypast to the British Newspaper Archive (with free access to all British Library content in our Reading Rooms), to British Library storage and to our project.

Links and further reading

The open access volume Digitised Newspapers – A New Eldorado for Historians? Reflections on Tools, Methods and Epistemology includes a chapter ‘Hunting for Treasure: Living with Machines and the British Library Newspaper Collection‘ that describes this work in more detail.

The intricacies of planning digitisation are discussed in the open access book Collaborative Historical Research in the Age of Big Data: Lessons from an Interdisciplinary Project that is officially launched today.

We have created a tool called alto2txt that can be used to process the METS/ALTO XML datasets into plain text files.

The newspaper titles are downloadable from our research repository.

Latest posts from us

Imagine you’ve set up a shiny new crowdsourcing project. How do you let people who might potentially want to...

Read

A list of public domain newspaper titles available within the Living with Machines project; downloadable for re-use by...

Read

We’re delighted to share the news that our data paper has been published by the Journal of Open Humanities Data....

Read

Imagine you’ve set up a shiny new crowdsourcing project. How do you let people who might potentially want to...

Read

A list of public domain newspaper titles available within the Living with Machines project; downloadable for re-use by...

Read

We’re delighted to share the news that our data paper has been published by the Journal of Open Humanities Data....

Read

Over half of a million pages of historical newspapers now openly available

Background

Selecting titles for digitisation

Scanning

Delivery

Links and further reading

Latest posts from us

Outreach and marketing for crowdsourcing tasks

June 27, 2024

Imagine you’ve set up a shiny new crowdsourcing project. How do you let people who might potentially want to...

Public domain newspaper titles in Living with Machines

May 7, 2024

A list of public domain newspaper titles available within the Living with Machines project; downloadable for re-use by...

New ‘language of mechanisation’ publication and datasets released

May 2, 2024

We’re delighted to share the news that our data paper has been published by the Journal of Open Humanities Data....

Outreach and marketing for crowdsourcing tasks

June 27, 2024

Imagine you’ve set up a shiny new crowdsourcing project. How do you let people who might potentially want to...

Public domain newspaper titles in Living with Machines

May 7, 2024

A list of public domain newspaper titles available within the Living with Machines project; downloadable for re-use by...

New ‘language of mechanisation’ publication and datasets released

May 2, 2024

We’re delighted to share the news that our data paper has been published by the Journal of Open Humanities Data....

Our Funder and Partners