Living with Machines OCR hack
The Living with Machines project will draw on maps, newspapers, press directories, census records and a wide range of other primary sources. A common theme across these sources is that we will be relying on OCR to extract text from the original source material in order to make this information available for computational analysis.
What is OCR and what are the challenges with working with OCR’d material?
OCR (optical character recognition) is the process of turning images of text into computer readable text. For example:
should be recognised as ‘The PRODUCE OF THE BRITISH MINES.’ by the OCR Software.
OCR has been an ongoing research area since the early days of computing. Though great improvements have been made in the accuracy of OCR it is not a solved problem. Although computers often achieve a high level of accuracy (i.e. over 95%) on modern well-formatted text they can struggle with recognising historical fonts. The quality of the OCR can be impacted by the quality of the digitised image, the original item, formating of the item, the typography used and the software used to produce the OCR.
Working with text produced through OCR produces challenges. If words are not properly recognised then they won’t appear in searches for this word. For example, if we want to find all the uses of the word ‘machine’ and it has been recognised as ‘mach1n£’ by the OCR software, our search won’t find a match even though this term appeared in the original source material. Generalised across our whole corpus these types of errors produced by the OCR may skew our results. If we wanted to explore the mentions of a particular type of machine, or place, across our newspaper corpus we may find our results skewed by OCR errors. This skew may be influenced by the year the newspaper was produced (assuming the quality of the original items correlates to age) or it may vary more unevenly depending on a range of other factors.
Exploring the impact of OCR for Living with Machines
To begin to explore these questions Living with Machines recently had a ‘hack week’. During this week we worked in groups to explore how OCR might impact on our research questions. Some of the questions we explored during this week:
- Whether a higher OCR confidence (the certainty the ocr software gives for the output produced from an image) correlate with more of the ocr output matching words in a dictionary
- The impact of OCR quality on various research tasks
- How effectively OCR could be used to extract text from maps.
Much of our hack week was spent setting up the workflow for exploring our question with further work remaining to be done. The team working on maps has written a post on their initial findings and we hope to continue sharing more on OCR soon!