Releasing content for researchers to re-use

Data acquisiton|Digitisation|Newspapers|OCRWritten by Claire AustinJune 15, 2023Comments: 0

A desire of the Living with Machines project (and indeed an AHRC ‘gold standard’) is to release as many of the newspapers we digitised as we can (subject to copyright) in forms that other researchers can access and interrogate. The digitisation process undertaken jointly by the British Library and FindMyPast for Living with Machines has resulted in a series of newspaper images and related automatically transcribed (OCR) text, which have been released in various ways.

I’ve blogged before about the complexity of identifying copyright for multiple items such as newspapers. Someone can’t copy an item where copyright is claimed unless they have permission from the owner or a legal exception applies.

The British Library has to be completely sure that it’s not infringing anyone’s copyright, so we only release something into the public domain when rights have expired. Sometimes this means that the British Library has to be careful about releasing items within an existing platform with certain restrictions. However, where copyright permits, we’ve been able to release a usefully large amount of newspaper titles by using different methods of dissemination and I will endeavour to explain them here!

FindMyPast who manage the British Newspaper Archive (BNA) for the British Library have a ‘Free to View’ section on their website. A user has to create an account with the BNA in the usual way, but doesn’t have to pay to view images in this section. Free to view newspapers can be searched using keywords, and images can be downloaded on an image by image basis. Each image will have its own copyright notice that will guide any re-use. If you want to look for something specific, this is a very good place to start, as the BNA has a search engine that helps to identify specific articles. You can also browse titles in this section.

Close-up photograph of a page of a newspaper. The page curves close to the binding of the volume of papers.

The automatically-transcribed (OCR) text that corresponds to the images on the BNA is also downloadable from the British Library’s repository’s News datasets section. The repository includes OCR text from other newspapers in addition to Living with Machines datasets. These files can be downloaded free of charge for use with machine learning – or any other methods come to that. Some of the newspaper titles are curated sample datasets and some have been released in their entirety. The newspaper OCR text that is on the repository can be repurposed and shared without restriction.

A screenshot of ALTO XML text, showing information such as Predicted Word Accuracy and information about margins.

Lastly, you may want to access the digitised images from which the OCR text was transcribed. Once the images have been processed, they can be requested from BL Labs using their standard protocols and decision making process. You can register your interest now by contacting BL Labs. None of these methods require you to pay for access to either images or OCR text.

We hope you have fun and success in using this content. Please let us know how you get on!

Latest posts from us

Imagine you’ve set up a shiny new crowdsourcing project. How do you let people who might potentially want to...

Read

A list of public domain newspaper titles available within the Living with Machines project; downloadable for re-use by...

Read

We’re delighted to share the news that our data paper has been published by the Journal of Open Humanities Data....

Read

Imagine you’ve set up a shiny new crowdsourcing project. How do you let people who might potentially want to...

Read

A list of public domain newspaper titles available within the Living with Machines project; downloadable for re-use by...

Read

We’re delighted to share the news that our data paper has been published by the Journal of Open Humanities Data....

Read

Releasing content for researchers to re-use

Latest posts from us

Outreach and marketing for crowdsourcing tasks

June 27, 2024

Imagine you’ve set up a shiny new crowdsourcing project. How do you let people who might potentially want to...

Public domain newspaper titles in Living with Machines

May 7, 2024

A list of public domain newspaper titles available within the Living with Machines project; downloadable for re-use by...

New ‘language of mechanisation’ publication and datasets released

May 2, 2024

We’re delighted to share the news that our data paper has been published by the Journal of Open Humanities Data....

Outreach and marketing for crowdsourcing tasks

June 27, 2024

Imagine you’ve set up a shiny new crowdsourcing project. How do you let people who might potentially want to...

Public domain newspaper titles in Living with Machines

May 7, 2024

A list of public domain newspaper titles available within the Living with Machines project; downloadable for re-use by...

New ‘language of mechanisation’ publication and datasets released

May 2, 2024

We’re delighted to share the news that our data paper has been published by the Journal of Open Humanities Data....

Our Funder and Partners