Over half of a million pages of historical newspapers now openly available
The datasets of out of copyright volumes up to 1902 are in a specialist format called METS/ALTO XML. They are supplemented by additional ‘free to view’ records in the British Newspaper Archive.
We look forward to seeing what you do with the data! Please get in touch at firstname.lastname@example.org if you have questions or want to share how you used the newspapers.
This blog post tells you how we created this newspaper collection and where you can find it.
From the project’s start our goals included digitising historical newspapers held at the British Library to support research into the long nineteenth century (c 1780s-1920), prioritising industrial areas that were under-represented in the British Newspaper Archive. After years of imaging, post-capture processing, ingest and rights clearance, our work is complete.
Selecting titles for digitisation
The British Library collection contains over 60 million individual newspaper issues. With such a vast collection and such open research questions, selecting the right papers for digitisation was not an easy task.
We’ve described some of the practical and intellectual challenges of selecting newspaper titles in our post about ‘Press Picker: visualising formats and title name changes in the British Library’s newspaper holdings‘.
Importantly, to help expedite the process, we prioritised longer and more consistent runs, and those with more microfilm available over shorter and physical-only titles.
Scanning speed, quality and cost depend on the source format. Digitising from microfilms is quicker and cheaper, but the quality of the final image is less optimal due to the possible deterioration of the acetate as well as of the lower quality of the original image. On the other hand, physical copies are more complex and delicate to handle. In many cases the volumes holding the papers needed extra conservation assessment, or tight bindings loosened – work that can add delays and increase costs. Prioritising microfilms, although less preferable from a curatorial point of view, allowed us to get the results to our researchers sooner.
Luckily the microfilms were all in good condition, except for some reels which were flagged in the Press Picker and discarded at the time of selection.
After copyright assessment, our list was sent to the British Library’s Imaging Service Team, who retrieved the microfilms and bound physical volumes and proceeded with scanning.
During scanning, each newspaper run was divided into years – the basic unit of organisation – and issues. The frequency of issues, the number of pages and the length of the run varies by newspaper title, which makes it difficult to assess the number of pages in each title in advance. This was a challenge for budgeting and planning timelines.
Sources that help calculate the number of pages in a title are the Waterloo Directories and original 19th century Press Directories. The Mitchell’s Press Directories (download link) are particularly important as they also provide further information like political leaning, price and circulation of the newspaper, as well as the names of owners.
Once scanned and quality checked, each newspaper run was delivered to Findmypast for further processing, including optical character and layout recognition (OCR and OLR) and packaging. Each year of publication of a specific newspaper is grouped into a zip file with a folder per issue and, within it, an ALTO/XML file per page and a METS/XML file as a metadata wrapper.
Once the material was OCR-ed and prepped it was delivered by Findmypast to the British Newspaper Archive (with free access to all British Library content in our Reading Rooms), to British Library storage and to our project.
Links and further reading
The open access volume Digitised Newspapers – A New Eldorado for Historians? Reflections on Tools, Methods and Epistemology includes a chapter ‘Hunting for Treasure: Living with Machines and the British Library Newspaper Collection‘ that describes this work in more detail.
The intricacies of planning digitisation are discussed in the open access book Collaborative Historical Research in the Age of Big Data: Lessons from an Interdisciplinary Project that is officially launched today.
We have created a tool called alto2txt that can be used to process the METS/ALTO XML datasets into plain text files.