Publications

Books

Articles, book chapters and conference proceedings

Datasets

Digitised collections

Press Directories

Press Directories list newspapers published each specific year, with locations, political leanings, cost, distribution and other information. Find out more about the Victorian newspapers landscape, view automatically transcribed text on the British Library open repository, or explore the newly released, enriched and structured version of Mitchell’s Newspaper Press Directory (1846-1920).

Historical Newspapers

We digitised over half a million pages of historical newspapers and made them ‘free to view’ on the British Newspaper Archive website. They cover a period ranging from c. 1780 to c. 1920 and cover most areas of England (excluding London).

Georeferenced OS Maps

Around 15,000 historical OS Maps have been digitised and fully georeferenced. We’re working on making these publicly available.

Pandaemonium: the coming of the machine as seen by contemporary observers, 1660-1886

From the personal papers of Humphrey Jennings, we digitised the original manuscript material for the posthumously published book (André Deutsch, 1985) a compendium of primary sources and testimonies about industrialisation in Britain. This archive contains three times more material than previously published. We are working to make these texts publicly available.

Derived datasets (new information or datasets that are created from existing data)

StopGB (Structured Timeline of Passenger Stations in Great Britain)

Taking Michael Quick’s book Railway Passenger Stations in Great Britain: a Chronology as a starting source, we transformed the listing of over 12,000 stations into a structured data format. Each station is given attributes such as operating companies and opening and closing dates. Where possible, they’re georeferenced and linked to Wikidata. This structured, linked, and georeferenced dataset could be a key resource for historical, digital library and semantic web communities, and others researching the impact of the railway in Great Britain.

You can find the related code here.

Dataset for Toponym Resolution in Nineteenth-Century English Newspapers

455 annotated articles from newspapers based in four different locations in England (Manchester, Ashton-under-Lyne, Poole and Dorchester), published between 1780 and 1870. Place names within articles were manually annotated and linked (where possible) to Wikipedia. The dataset was produce to aid toponym resolution in English-language digitised historical newspapers. The dataset is especially of interest to researchers working on improving semantic access to historical newspaper content.

Living Machines Atypical Animacy dataset

The Atypical Animacy detection dataset is based on nineteenth-century sentences in English extracted from an open dataset of nineteenth-century books digitised by the British Library (available via https://doi.org/10.21250/db14, British Library Labs, 2014). This dataset contains 598 sentences containing mentions of machines. Each sentence has been annotated according to the animacy and humanness of the machine in the sentence. This dataset has been created as part of the following paper: Ardanuy, M. C., F. Nanni, K. Beelen, Kasra Hosseini, Ruth Ahnert, J. Lawrence, Katherine McDonough, Giorgia Tolfo, D.C.S. Wilson and B. McGillivray. “Living Machines: A study of atypical animacy.” In Proceedings of the 28th International Conference on Computational Linguistics (COLING2020).

You can find the related code here.

Diachronic and diatopic word embeddings for digitised newspapers

Decade-level Word2vec models from automatically transcribed 19th-century digitised newspapers by the British Library (1800-1919).

Word embeddings trained on a 4.2-billion-word corpus of 19th-century British newspapers using Word2Vec. 12 models are available, one for each decade between 1800 and 1919. The models can be used for the analysis of the language of each decade on its own or as diachronic word embeddings since they have been aligned using Orthogonal Procrustes. Open-source tools to explore them as diachronic embeddings are available here.

Living with Machines alpha and beta Zooniverse ‘accident’ task data

Annotations on some 19th century newspaper articles that possibly mentioned accidents involving machinery, via crowdsourcing tasks on the Zooniverse platform. Some personal, organisational and place names mentioned were transcribed with a brief summary of relevant accidents.

Neural Language Models for Historical Research

Four types of pre-trained neural language models trained on a large historical dataset of books in English, published between 1760-1900 and composed of ~5.1 billion tokens. The language model architectures include word type embeddings (word2vec and fastText) and contextualised models (BERT and Flair). For each architecture, we trained a model instance using the whole dataset. Additionally, we trained separate instances on text published before 1850 for the type embeddings (i.e., word2vec and fastText), and four instances considering different time slices for BERT.

You can find the related code here.

Geocoded census data (work-in-progress)

I-CeM census data enhanced with links to OS Open Roads and GB1900.

Geocoded census data enhanced with StopsGB distance and MapReader metrics (work-in-progress)

Geo-coded census data enhanced with distances to nearest StopsGB station and railspace score derived from MapReader identified patches.

Data visualisations and visualisation tools

Macromap

‘Macromap’ is an interactive ‘small multiples’ visualisation for historical map collections. It is designed to help researchers understand what map sheets the British Ordnance Survey (OS) made, when and where. The interface could alternatively be used with other historical maps and map series metadata, and, more generally, to understand the geographic and temporal shape of large-scale polygon datasets. Find out more at our blog post Macromap: Interactive Maps in Time and Observable Macromap Notebook.

Press Tracer

A data visualisation to help trace the lineage of historical newspaper titles in the British Library. Find out more at Press Tracer: Visualise Newspaper Lineage and dig into the code behind Press Tracer.

Branching sparklines line graphs

A notebook that demonstrates the branching design used in Press Picker: an interactive visualisation tool for newspaper metadata at the British Library.

Software packages, research tools and code supporting published papers

Software packages

Alto2Text

A plain text and metadata extraction tool that processes XML in METS/ALTO format into plain text and metadata fields. Will be available soon, in XSLT and Python versions.

DeezyMatch – A Flexible Deep Neural Network Approach to Fuzzy String Matching

DeezyMatch, a new deep learning approach to fuzzy string matching and candidate ranking, is free, open-source  software. It addresses advanced string matching and candidate ranking challenges in a comprehensive and integrated manner.

Zoonyper

Zoonyper is a work-in-progress Python library that facilitates interpretation and wrangling for Zooniverse files in Jupyter and Python more generally.

Map Reader

MapReader is a free open-source software library written in Python for analysing large map collections. MapReader allows users with little or no computer vision expertise to i) retrieve maps via web-servers; ii) preprocess and divide them into patches; iii) annotate patches; iv) train, fine-tune, and evaluate deep neural network models; and v) create structured data about map content.

PressPicker

PressPicker is a software package created to help the selection of newspapers to digitise in Living with Machines. Thanks to this Jupyter Notebook-based tool we could cater for research-led and practical factors, e.g. selecting titles by format, viewing the holdings of a specific title at a glance, and predicting retrieval and scanning timeframes and costs. Find out more about how PressPicker was created and how it works.

nnanno

Newspapers are a visual medium, communicating through text and visual information such as photographs, comics, maps and other images. Research on images within newspapers is advancing as developments in computer vision are powered by deep learning-based approaches. A recent project, Newspaper Navigator, from Benjamin Lee and the Library of Congress Labs extracted visual content from Chronicling America.

To help make it easier to work with this dataset, we created a tool called nnanno. nnanno helps with sampling from the newspaper navigator dataset, downloading images, annotating and experimentally applying computer vision models to the newspaper navigator data.

Flyswot

flyswot is a Command Line Tool which allows you to run Hugging Face Transformers image classification models available via the Hugging Face Hub 🤗 against a directory of images. It returns a CSV report containing the models predictions. It is being used by the Heritage Made Digital team at the British Library to run computer vision models which are used to predict whether an image of a manuscript page contains a ‘flysheet’ or not.

Hugit

hugit is a command line tool for loading ImageFolder style datasets into a HuggingFace 🤗 dataset and pushing to the HuggingFace 🤗 hub.

The primary goal of hugit is to help quickly get a local dataset into a format that can be used for training computer vision models. hugit was developed to support the workflow for flyswot where we wanted a quicker iteration between creating new training data, training a model, and using the new model inside flyswot.

Census Geo-coder (work-in-progress)

A Python package that can link historic Great British Census data to existing GIS datasets of streets using geo-blocking and fuzzy string matching. Currently we use OS Open Roads and GB1900 but it will accept any GIS of roads.

Research Tools

Neural Language Models for Historical Research

Four types of pre-trained neural language models trained on a large historical dataset of books in English, published between 1760-1900 and comprised of ~5.1 billion tokens. The language model architectures include word type embeddings (word2vec and fastText) and contextualized models (BERT and Flair). For each architecture, we trained a model instance using the whole dataset. Additionally, we trained separate instances on text published before 1850 for the type embeddings (i.e., word2vec and fastText), and four instances considering different time slices for BERT. This repository describes the models and has code that shows how to use the models.

Living with Machines GitHub Stats report

GitHub provides statistics for repositories which include views and clones traffic. However, by default, this information is only shown for two weeks. This repository uses GitHub Actions and gh_orgstats to grab data every week and update a CSV file for public repositories under the Living with Machines GitHub Organization. You can find this documented in more detail below. This repository also uses Jupyter notebooks and nbconvert to update a report based on these GitHub stats automatically.

Gh_orgstats

Gh_orgstats is a small Python wrapper for retrieving GitHub stats for a particular organization.

Zooniverse images uploader

Images from the digitised newspaper articles were selected and uploaded to Zooniverse for annotation. Defoe, a spark-based toolbox for analysing digital historical textual data, was used to select the images for annotation on Zooniverse. It can also be used in tasks such as sentence/document classification.

You can find the Defoe code here.

Word2vec model explorers and lexicon expansion

Notebook for exploring word2vec models in order to build a lexicon that can trace certain topics in a collection. The Lexicon Expansion Interface allows users to navigate a vector space and expand a list of seed words into a Lexicon.

Pipeline for preprocessing, training, and aligning diachronic word embeddings from Big Historical Data

Python pipeline for training and exploring diachronic word embeddings (Word2Vec) from very large historical data for which metadata on the year of publication of each text file is available. Given a series of text files each containing all the texts for one particular time unit, the pipeline allows you to diachronic embedding models and carry out semantic change analysis on them including semantic change detection, visualization of meaning change trajectories and clustering of semantic change types.

Pipeline for processing the Newspaper Press Directories

The series of notebooks includes a pipeline for processing the OCR (derived from the scans of Mitchell’s Press Directories). The stages include: annotation, preprocessing, automatic tagging and database ingest.

Bl-books-genre detection model

This fine-tuned distilbert-base-cased model is trained to predict whether a book from the British Library’s Digitised printed books (18th-19th century) book collection is fiction or non-fiction based on the title of the book.

Flyswot computer vision model

A fine-tuned computer vision model (convnext-tiny-224) that has been trained to classify different types of digitised manuscript pages in order to detect digitised manuscripts which have incorrect metadata associated with them.

JISC Wrangler

The jisc-wrangler is a Python tool written specifically to restructure and deduplicate XML files containing OCR content from the JISC 1 & JISC 2 newspaper dataset. It outputs a canonical file structure and filename convention amenable to further processing with the alto2txt tool. This tool makes the JISC 1 & JISC 2 newspaper datasets accessible to the research project by cleaning, deduplicating and standardising the directory structure and filenames. It performs an essential pre-processing step that unlocks the potential of this open-access dataset.

Code supporting published papers

Code for Targeted Sense Disambiguation

Underlying code and materials for the paper ‘When Time Makes Sense: A Historically-Aware Approach to Targeted Sense Disambiguation. Time-sensitive Targeted Sense Disambiguation (TSD) aims to detect instances of a sense or set of related senses in historical and time-stamped texts. It aims to 1) scrutinise the effect of applying historical language models on the performance of several TSD methods and 2) assess different disambiguation methods that take into account the year in which a text was produced. 

Code for Atypical Animacy

Underlying code and materials for the paper ‘Living Machines: A Study of Atypical Animacy’ (COLING2020).

Code for Station to Station: Linking and Enriching Historical British Railway Data & StopGB

Underlying code and materials for the paper ‘Station to Station: Linking and Enriching Historical British Railway Data’.

Code and supplementary material for the paper ‘A Deep Learning Approach to Geographical Candidate Selection through Toponym Matching’

Underlying code and materials for the paper ‘A Deep Learning Approach to Geographical Candidate Selection through Toponym Matching’, accepted to SIGSPATIAL2020 as a poster paper. This work looks for potential locations for each toponym (place name) identified in text. It addresses the issues of a high degree of variation in toponyms (due to regional spelling differences, transliterations strategies, cross-language and diachronic variation) and variations due to OCR errors.

Supplementary material: https://zenodo.org/record/4034818

Code for the paper ‘Assessing the Impact of OCR Quality on Downstream NLP Tasks’

Underlying code for the paper ‘Assessing the Impact of OCR Quality on Downstream NLP Tasks’ The code runs experiments reported in the paper and generates the figures used in the paper. 

Code for the paper ‘Resolving Places, Past and Present: Toponym Resolution in Historical British Newspapers Using Multiple Resources’

Resolving Places is one of the first outputs of Living with Machines, a collaborative digital history project at The Alan Turing Institute and the British Library. This research is part of our work to build a nineteenth-century gazetteer that combines place names derived from historical sources (GB1900) with online resources (Wikipedia and Geonames). The Living with Machines gazetteer follows best practices in combining multiple existing resources, and is novel in accounting for places that have different scales (e.g. streets, buildings, cities, counties). 

Prizes and honours

Inter Circle U. Prize

Barbara McGillivray was awarded the Inter Circle U. Prize for inter- and transdisciplinary research for “The Language of Mechanisation” project, joined by Jon Lawrence, Mia Ridge, Kalle Westerling, Giorgia Tolfo and Nilo Pedrazzini. The prize is co-funded by the European Union’s Horizon 2020 Research and Innovation Programme.

Public engagement

Crowdsourcing – digital volunteering to create research data

Crowdsourcing around digitised collections was built into our public engagement plans from the start of the project. Our tasks have been designed to expose some of the processes of data science and digital history to participants, while also collecting data to the quality required for the computational linguistic processes they support.

Our projects are built on Zooniverse and available at https://www.zooniverse.org/projects/bldigital/living-with-machines/.

Workshops and tutorials

 

Zooniverse: how to download and analyse your task annotations

This workshop was created for British Library staff to introduce the widely used Zooniverse platform, the services it offers and share new developments in using the Library’s IIIF items on Zooniverse. It was also aimed at teaching how to process annotations to obtain clean and readable spreadsheets for use in personal and library projects.

Genre Classification

This Jupyter book was created to document work to develop a machine learning model and associated datasets with the goal of classifying the genre of books from the British Library. Discover the background for this project here.

Intro to AI for GLAM

This lesson carpentry was developed with the aim of empowering GLAM (Galleries, Libraries, Archives, and Museums) staff by providing the foundation to support, participate in and begin to undertake in their own right, machine learning-based research and projects with heritage collections.

How to use jupyter notebooks

A workshop given as part of the Digital Scholarship “hack and yack” cycle whose aim is to explain what a Jupyter notebook is and why they are used, how notebooks created by other people can be run and introduce the learner to some weird/wonderful stuff that can be done with notebooks.

Working with maps at scale using Computer Vision and Jupyter notebooks

A workshop delivered as part of Digital Humanities and Digital Archives workshop at the National Library of Estonia to show how Jupyter notebooks can be particularly useful for working with digitised collections at scale, to give a brief sense of what is possible using computer vision with image collections and give some ideas for how existing GLAM infrastructure (in this case IIIF) can support new machine learning-based approaches.

Computer-Vision-for-the-Humanities-workshop

This workshop aims to provide an introduction to computer vision for humanities uses. In particular this workshop focuses on providing a high level overview of machine learning based approaches to computer vision focusing on supervised learning. The workshop includes discussion on working with historical data. The materials are based on a two-part Programming Historian lesson.

Programming Historian: Computer vision for the humanities: an introduction to deep learning for image classification

A two-part programming historian tutorial which aims to introduce humanities researchers, or those working with humanities data, to deep-learning based computer vision methods. Work-in-progress on part one, part two.

image-search: Materials for a workshop on image search for heritage data

Materials for a workshop on image search with a focus on heritage data. The workshop is based on a blog post Image search with HuggingFace 🤗datasets but goes into a little bit more detail.

Podcast episodes

 

Our Funder and Partners