Releasing content for researchers to re-use

|||Written by Claire AustinJune 15, 2023Comments: 0

A desire of the Living with Machines project (and indeed an AHRC ‘gold standard’) is to release as many of the newspapers we digitised as we can (subject to copyright) in forms that other researchers can access and interrogate. The digitisation process undertaken jointly by the British Library and FindMyPast for Living with Machines has resulted in a series of newspaper images and related automatically transcribed (OCR) text, which have been released in various ways.  

I’ve blogged before about the complexity of identifying copyright for multiple items such as newspapers. Someone can’t copy an item where copyright is claimed unless they have permission from the owner or a legal exception applies.

The British Library has to be completely sure that it’s not infringing anyone’s copyright, so we only release something into the public domain when rights have expired. Sometimes this means that the British Library has to be careful about releasing items within an existing platform with certain restrictions. However, where copyright permits, we’ve been able to release a usefully large amount of newspaper titles by using different methods of dissemination and I will endeavour to explain them here!

FindMyPast who manage the British Newspaper Archive (BNA) for the British Library have a ‘Free to View’ section on their website. A user has to create an account with the BNA in the usual way, but doesn’t have to pay to view images in this section.  Free to view newspapers can be searched using keywords, and images can be downloaded on an image by image basis. Each image will have its own copyright notice that will guide any re-use. If you want to look for something specific, this is a very good place to start, as the BNA has a search engine that helps to identify specific articles. You can also browse titles in this section. 

Close-up photograph of a page of a newspaper. The page curves close to the binding of the volume of papers.

The automatically-transcribed (OCR) text that corresponds to the images on the BNA is also downloadable from the British Library’s repository’s News datasets section. The repository includes OCR text from other newspapers in addition to Living with Machines datasets. These files can be downloaded free of charge for use with machine learning – or any other methods come to that. Some of the newspaper titles are curated sample datasets and some have been released in their entirety. The newspaper OCR text that is on the repository can be repurposed and shared without restriction. 

A screenshot of ALTO XML text, showing information such as Predicted Word Accuracy and information about margins.

Lastly, you may want to access the digitised images from which the OCR text was transcribed. Once the images have been processed, they can be requested from BL Labs using their standard protocols and decision making process.  You can register your interest now by contacting BL Labs. None of these methods require you to pay for access to either images or OCR text.

We hope you have fun and success in using this content. Please let us know how you get on!

Our Funder and Partners