Can we use machine learning to classify whether a book is ‘fiction’ or ‘non-fiction’ from its title?
summary: we built a book genre classification model using training data created by British Library staff; the model works quite well but not well enough to use for creating catalogue data yet; you can see the process in a tutorial
Metadata is a vital tool for helping people use library collections. When you search for a book, good quality metadata can help you to filter by date of publication, language, or the book’s length. Metadata can also help disambiguate: for example, if you search for “python”, good metadata should allow you to sort books about the programming language from books about snakes.
However, metadata doesn’t emerge out of thin air. Most of the time, a human is involved in creating it. However, there just aren’t enough humans with enough time to create all the metadata that we’d like to have. The challenge of capturing extensive metadata for all items in a collection also becomes more difficult with the growth of digitisation and born-digital materials. As a result, many libraries and other GLAMs (Galleries, Libraries, Archives and Museums) are increasingly interested in using machine learning methods to help create or augment metadata for collections. For example, the AI4LAM community is “focused on advancing the use of artificial intelligence in, for and by libraries, archives and museums“. Machine learning methods use training data with examples of input data and desired output labels to ‘teach’ a machine to do a task on new unseen data.
One collection which could benefit from additional metadata is digitised books created by the British Library in partnership with Microsoft. This project digitised a selection of out-of-copyright 18th- and 19th-century texts. This collection was subsequently made openly available in several formats across many platforms. However, the metadata for the collection was incomplete, which created challenges for working with the material. In order to revise and enrich existing metadata Victoria Morris (a member of the BL’s Metadata Standards team) set up an internal crowdsourcing project on the Zooniverse platform with the aim of enhancing records for Microsoft Digitised Books by providing information about:
- Country and place of publication
- Date of publication
- Language(s) of content
- Literary form (fiction or non-fiction)
- Genre (for fiction items)
This crowdsourcing project was a helpful step towards increasing the metadata coverage of this collection. However, this crowdsourcing effort could only provide updated metadata for a subset of the entire collection.
This is where we thought machine learning could potentially come in. We wanted to explore whether we could train a machine learning model to extend the metadata created through crowdsourcing to the whole collection. In particular, we wanted to test whether a machine learning model could detect whether a book was fiction or non-fiction just by looking at the title.
The short answer is that we have been partially successful in getting a machine learning model to predict the correct genre of books from the title.
We have also created a series of tutorials that cover the steps involved in making this model. These allow others to investigate whether our approach is sensible, and potentially learn from and build on our strategy for their work. If you have any feedback on this tutorial, we would love to hear from you! You can either open an issue on the GitHub repository for the book, send us a tweet @livingwachines or use the contact form.
You can play with the current model here: https://huggingface.co/spaces/BritishLibraryLabs/British-Library-books-genre-classifier-v2 and view the metadata for the collection with additional crowdsourced metadata on the British Library research repository.
Credits and Thanks
This work of producing the Jupyter Book Tutorial was developed by Daniel van Strien, Giorgia Tolfo, Victoria Morris and Kaspar Beelen. Thank you to everyone who contributed to the crowdsourcing task. You can find the names of people who were happy to be credit on the British Library’s Research Repository page for the annotated dataset: https://doi.org/10.23636/BKHQ-0312