A quick tour of two counties

Written by David BeavanAugust 22, 2019Comments: 0

Our initial newspaper data comes from the British Newspaper Archive, thanks to the British Library and FindMyPast. We initially sampled two contrasting counties: Lancashire in North West England (industrialised), and Dorset in South West England (much less industrialised). Here’s a quick tour of the data we have (treating both counties as one sample):

Total number of newspaper publications: 64
Total word count (millions): 6360.122871

That’s a whopping six thousand million words of content, which depending on your definition of billion, might be six billion words. Or not. What is telling, is the slow ramp up of both the count of publications, and with that their word count. This means not all years are equal, and we must during the project statistically normalise (BTW bonus retro web content) the corpus if we are to compare years in any like-for-like way.

Let’s dig deeper. The newspaper pages have been segmented into a number of different areas prior to LwM receiving them. Are these classifications useful? E.g. to separate articles (full of news) from adverts (full of repeating not-news)?

Total article count (millions): 10.029062

That doesn’t seem right. The article type dominates, to the extent that I wouldn’t trust this data set, as the other lines barely move from the horizontal. Surely there is more to adverts (and other content) than that? Well, actually there are:

Graph of word counts of articles over time

Table of average word counts by article type

This looks more like it, so we have a (comparatively) small number of wordy adverts and a large number of concise articles. Let’s see what that word count looks like if we do normalise the corpus, and instead of looking at raw numbers, we look at the proportion of each type:

Graph of proportion of word counts of article types over time

There’s still something going on between 1780 and 1830, but from then it’s pretty stable. A quick glance at the raw data makes me think we’re just short of samples: many of those early years have at most three publications making up the data, and we know from the graphs their word count is low. The article type length is fairly stable as it descends from 2,000 words to 500 words across the 135 years (still a drop during the stable data years from 1830):

Graph of average word count of article types over time

Graph of average word count of articles over time (minus advertisements)

The exploration here demonstrates how important it is to fully understand the makeup and materiality of our content, and the various biases of our data as it moves from physical to digital. Also, how distant and closer views can illuminate different perspectives and often pose more questions than they answer. Do share your thoughts and observations, on the data, the methods, and what you see?

Latest posts from us

Imagine you’ve set up a shiny new crowdsourcing project. How do you let people who might potentially want to...

Read

A list of public domain newspaper titles available within the Living with Machines project; downloadable for re-use by...

Read

We’re delighted to share the news that our data paper has been published by the Journal of Open Humanities Data....

Read

Imagine you’ve set up a shiny new crowdsourcing project. How do you let people who might potentially want to...

Read

A list of public domain newspaper titles available within the Living with Machines project; downloadable for re-use by...

Read

We’re delighted to share the news that our data paper has been published by the Journal of Open Humanities Data....

Read

A quick tour of two counties

Latest posts from us

Outreach and marketing for crowdsourcing tasks

June 27, 2024

Imagine you’ve set up a shiny new crowdsourcing project. How do you let people who might potentially want to...

Public domain newspaper titles in Living with Machines

May 7, 2024

A list of public domain newspaper titles available within the Living with Machines project; downloadable for re-use by...

New ‘language of mechanisation’ publication and datasets released

May 2, 2024

We’re delighted to share the news that our data paper has been published by the Journal of Open Humanities Data....

Outreach and marketing for crowdsourcing tasks

June 27, 2024

Imagine you’ve set up a shiny new crowdsourcing project. How do you let people who might potentially want to...

Public domain newspaper titles in Living with Machines

May 7, 2024

A list of public domain newspaper titles available within the Living with Machines project; downloadable for re-use by...

New ‘language of mechanisation’ publication and datasets released

May 2, 2024

We’re delighted to share the news that our data paper has been published by the Journal of Open Humanities Data....

Our Funder and Partners