Collecting annotations from British Library staff

Annotation|OCRWritten by Daniel van StrienSeptember 23, 2019Comments: 0

Like many projects working with digitised historic materials, the historians and data scientists on the project are facing challenges in working with text produced through Optical Character Recognition (OCR). Some of the project researchers are trying to quantify the impact of OCR quality issues on Natural Language Processing (NLP) tasks. In plain English, these include tasks like labelling parts of a sentence for semantic analysis or identifying the names of people, organisations, places and concepts mentioned in digitised texts.

During the summer we had an open annotation session in which we invited British Library staff members to help with our experiments. We planned four different linguistic annotation tasks (named entity recognition, recognition of machines, entity linking to Wikipedia, and semantic role labeling) on newspaper articles from the nineteenth century.

The annotators were encouraged to try the different tasks, and express all the doubts they encountered. This proved invaluable in redefining the task boundaries in those cases in which they were less clear. This session was also a useful test of our annotation guidelines and see whether our internal documentation was fit for slightly more public consumption.

We also got feedback that one of our annotations ‘machine object’ was sometimes difficult to annotate because the concept was broad. This returns to ongoing discussions within the project about what ‘counts’ as a machine and whether we should focus on historical approaches to this concept or contemporary ones. Ambiguities like these highlight the benefits of carrying out annotations for better understanding your source material and whether your proposed downstream computational tasks match this source material.

Screenshot of the Inception annotation tool with annotation labels for location entities in a 19th Century newspaper article — Screenshot of the Inception annotation tool

Feedback on inception

A side benefit of hosting the session is that it gave us the opportunity to test out or chosen annotation tool ‘Inception’ with a largish group of users. Inception is an open source tool for doing linguistic annotations. We have been using the tool for internal annotations but were keen to get feedback on using the tool from British Library staff.

After the session was complete we asked for feedback on the tool and the tasks carried out in the session. Most of the feedback agreed that the tool was accessible and fast. There were a few challenges around using the tool for our task. One of the main requests is that it would have been useful to also have the image of the newspaper article available alongside the OCR, particularly for the bad OCR examples. This is not surprising given the difficulty of reading the examples with bad OCR.

We hope to explore host more sessions focused on annotation of text soon. We are also keen to branch out into annotations focused on visual material which will help build on work done as part of the Space and Time Lab.

Latest posts from us

Imagine you’ve set up a shiny new crowdsourcing project. How do you let people who might potentially want to...

Read

A list of public domain newspaper titles available within the Living with Machines project; downloadable for re-use by...

Read

We’re delighted to share the news that our data paper has been published by the Journal of Open Humanities Data....

Read

Imagine you’ve set up a shiny new crowdsourcing project. How do you let people who might potentially want to...

Read

A list of public domain newspaper titles available within the Living with Machines project; downloadable for re-use by...

Read

We’re delighted to share the news that our data paper has been published by the Journal of Open Humanities Data....

Read

Collecting annotations from British Library staff

Feedback on inception

Latest posts from us

Outreach and marketing for crowdsourcing tasks

June 27, 2024

Imagine you’ve set up a shiny new crowdsourcing project. How do you let people who might potentially want to...

Public domain newspaper titles in Living with Machines

May 7, 2024

A list of public domain newspaper titles available within the Living with Machines project; downloadable for re-use by...

New ‘language of mechanisation’ publication and datasets released

May 2, 2024

We’re delighted to share the news that our data paper has been published by the Journal of Open Humanities Data....

Outreach and marketing for crowdsourcing tasks

June 27, 2024

Imagine you’ve set up a shiny new crowdsourcing project. How do you let people who might potentially want to...

Public domain newspaper titles in Living with Machines

May 7, 2024

A list of public domain newspaper titles available within the Living with Machines project; downloadable for re-use by...

New ‘language of mechanisation’ publication and datasets released

May 2, 2024

We’re delighted to share the news that our data paper has been published by the Journal of Open Humanities Data....

Our Funder and Partners