Trove newspapers

Trove API access restrictions

With the cancellation of my Trove API keys by the National Library of Australia, I've made the difficult decision to stop work on Trove and archive all related code repositories.

The Trove sections of the GLAM Workbench will remain online, but they won't be updated. Everything here is openly-licensed, so feel free to take what’s useful and develop it further yourself.

Given the fact that the NLA is willing to change the API terms of use to restrict access without any consultation, provides no transparency around acceptable use of full text content, and is willing to cancel API keys without warning, I can no longer recommend Trove as a reliable source for digital research.

Current version: v2.0.0 ¶

Assorted experiments and examples working with Trove’s digitised newspapers.

See below for information on running these notebooks in a live computing environment. Or just take them for a spin using Binder.

Trove newspapers in context¶

Notebooks in this section look at the Trove newspaper corpus as a whole, to try and understand what's there, and what's not.

Visualise the total number of newspaper articles in Trove by year & state ¶

Trove currently includes more 200 million digitised newspaper articles published between 1803 and 2015. In this notebook we explore how those newspaper articles are distributed over time, and by state.

Analyse rates of OCR correction ¶

The full text of newspaper articles in Trove is extracted from page images using Optical Character Recognition (OCR). The accuracy of the OCR process is influenced by a range of factors including the font and the quality of the images. Many errors slip through. Volunteers have done a remarkable job in correcting these errors, but it's a huge task. This notebook explores the scale of OCR correction in Trove.

Finding non-English newspapers in Trove ¶

There are a growing number of non-English newspapers digitised in Trove. However, if you're only searching using English keywords, you might never know that they're there. I thought it would be useful to generate a list of non-English newspapers, but it wasn't quite as straightforward as I thought.

Beyond the copyright cliff of death ¶

Most of the newspaper articles on Trove were published before 1955, but there are some from the later period. Let's find out how many, and which newspapers they were published in.

Gathering historical data about the addition of newspaper titles to Trove ¶

The number of digitised newspapers available through Trove has increased dramatically since 2009. Understanding when newspapers were added is important for historiographical purposes, but there's no data about this available directly from Trove. This notebook uses web archives to extract lists of newspapers in Trove over time, and chart Trove's development.

Visualising searches¶

Notebooks in this section demonstrate some ways of visualising searches in Trove newspapers – seeing everything rather than just a list of search results.

QueryPic ¶

This is the latest iteration of QueryPic with many new features. Use it to visualise searches in Trove's newspapers and gazettes, aggregating the number of results by day, month, or year. Simply copy and paste a url from a Trove web search to get started. QueryPic's charts help you explore patterns and trends, and if you find something interesting you can click on a point to view the results in Trove for that time period.

QueryPic Deconstructed ¶

This is an older version of QueryPic that lets you build queries using keywords, states, or newspapers.

Visualise Trove newspaper searches over time ¶

This notebook helps you zoom out and explore how the number of Trove newspaper articles in your search results varies over time by using the decade and year facets. We then combine this approach with other search facets to see how we can slice a set of results up in different ways to investigate historical changes.

Map Trove newspaper results by state ¶

Uses the Trove state facet to create a choropleth map that visualises the number of search results per state.

Map Trove newspaper results by place of publication ¶

Uses the Trove title facet to find the number of results per newspaper, then merges the results with a dataset of geolocated newspapers to map where articles were published.

Map Trove newspaper results by place of publication over time ¶

Adds a time dimension to the examples in the previous notebook to create an animated heatmap.

Harvesting data¶

Notebooks in this section help you harvest data relating to Trove's newspapers. To harvest all the newspaper articles from a search, see the Trove Newspaper and Gazette Harvester.

Harvest information about newspaper issues ¶

When you search Trove's newspapers, you find articles – these articles are grouped by page, and all the pages from a particular date make up an issue. But how do you find out what issues are available? On what dates were newspapers published? This notebook shows how you can get information about issues from the Trove API.

Harvest the issues of a newspaper as PDFs ¶

This notebook harvests issues of a newspaper as PDFs – one PDF per issue. If the newspaper has an long print run, this will consume large amounts of time and disk space, so you might want to limit your harvest by date range.

Harvest Australian Women's Weekly covers (or the front pages of any newspaper)¶

Somewhat confusingly, the Australian Women's Weekly is in with Trove's digitised newspapers and not the rest of the magazines. There are notebooks in the GLAM Workbench's journals section to help harvest all of a journal's covers as images, so I thought I should do the same for the Weekly. This notebook can be easily adjusted to download the front pages of any digitised newspaper.

Useful tools¶

Notebooks in this section provide useful tools that extend or enhance the Trove web interface and API.

Save a Trove newspaper article as an image ¶

Sometimes you want to be able to save a Trove newspaper article as an image. Unfortunately, the Trove web interface doesn't make this easy. The 'Download JPG' option actually loads an HTML page, and while you could individually save the images embedded in the HTML page, often articles are sliced up in ways that make the whole thing hard to read and use. This notebook grabs the page on which an article was published, and then crops the page image to the boundaries of the article. The result is a complete, intact image which presents the article as it was originally published. And if the article is split across multiple pages, you'll get one image per page.

Save Trove newspaper article as image web app ¶

A simple web app that helps you save a Trove newspaper article as an image.

Download a page image ¶

The Trove web interface doesn’t provide a way of getting high-resolution page images from newspapers. This simple app lets you download page images as complete, high-resolution JPG files.

Generate an article thumbnail ¶

Generate a nice square thumbnail image for a newspaper article.

Upload Trove newspaper articles to Omeka-S ¶

This notebook steps through the process of uploading Trove newspaper articles to your own Omeka-S instance via the API. As well as uploading the article metadata, it attaches image(s) and PDFs of the articles, and creates a linked record for the publishing newspaper. The source of the articles can be a Trove search, a Trove list, a Zotero collection, or just a list of article ids.

Tips and tricks¶

Notebooks in this section provide some useful hints to use with the Trove API.

Today’s news yesterday ¶

Uses the date index and the firstpageseq parameter to find articles from exactly 100 years ago that were published on the front page. It then selects one of the articles at random and downloads and displays an image of the front page.

Create a Trove OCR corrections ticker ¶

Uses the has:corrections parameter to get the total number of newspaper articles with OCR corrections, then displays the results, updating every five seconds.

Get the page coordinates of a digitised newspaper article from Trove ¶

This notebook demonstrates how to find the coordinates of a newspaper article on a digitised page.

Get creative¶

Notebooks in this section look at ways you can use data from Trove newspapers in creative ways.

Make composite images from lots of Trove newspaper thumbnails ¶

This notebook starts with a search in Trove's newspapers. It uses the Trove API to work its way through the search results. For each article it creates a thumbnail image using the code from this notebook. Once this first stage is finished, you have a directory full of lots of thumbnails. The next stage takes all those thumbnails and pastes them one by one into a BIG image to create a composite, or mosaic.

Create 'scissors and paste' messages from Trove newspaper articles ¶

When you search for a term in Trove's digitised newspapers and click on individual article, you'll see your search terms are highlighted. If you look at the code you'll see the highlighted box around the word includes its page coordinates. That means that if we search for a word, we can find where it appears on a page, and by cropping the page to those coordinates we can create an image of an individual word. By combining these images we can create scissors and paste style messages!

Create large composite images from snipped words ¶

This is a variation of the 'scissors & paste' notebook that extracts words from Trove newspaper images and compiles them into messages. In this notebook, you can harvest multiple versions of a list of words and compile them all into one big image.

Data and images¶

Data about newspaper issues ¶

This datatset includes information about digitised newspaper issues, including the total number of issues per newspaper/year, and a complete list of issues, identifiers and dates for every digitised newspaper in Trove.

Newspaper titles harvested from web archives ¶

The number of digitised newspapers available through Trove has increased dramatically since 2009. Understanding when newspapers were added is important for historiographical purposes, but there's no data about this available directly from Trove. These datasets were created by harvesting information about newspaper titles in Trove from web archives.

Australian Women's Weekly issues and front covers, 1933 to 1982 ¶

This dataset contains metadata and front cover images of 2,566 Australian Women's Weekly issues on Trove published between 1933 and 1982.

Trove newspapers with non-English language content ¶

This dataset contains information about newspapers published in languages other than English that have been digitised and made available through Trove. Data about the languages present in newspapers was generated by harvesting a sample of articles from each newspaper using the Trove API, and then using language detection software on the OCRd text of each article.

Trove newspapers with articles published after 1954 ¶

CSV formatted dataset containing a list of digitised newspapers in Trove with articles published after 1954 (the copyright cliff of death).

OCR corrections in Trove newspapers ¶

OCR errors in Trove's digitised newspapers can be corrected by users. To help understand patterns in newspaper correction, this dataset has been created to record information about the number of articles with corrections.

Run these notebooks¶

There are a number of different ways to use these notebooks. Binder is quickest and easiest, but it doesn't save your data. I've listed the options below from easiest to most complicated (requiring more technical knowledge).

Using ARDC Binder¶

Click on the button above to launch the notebooks in this repository using the ARDC Binder service. This is a free service available to researchers in Australian universities. You'll be asked to log in with your university credentials. Note that sessions will close if you stop using the notebooks, and no data will be preserved. Make sure you download any changed notebooks or harvested data that you want to save.

See Using ARDC Binder for more details.

Using Binder¶

Click on the button above to launch the notebooks in this repository using the Binder service (it might take a little while to load). This is a free service, but note that sessions will close if you stop using the notebooks, and no data will be saved. Make sure you download any changed notebooks or harvested data that you want to save.

See Using Binder for more details.

Using Reclaim Cloud¶

Reclaim Cloud is a paid hosting service, aimed particularly at supported digital scholarship in hte humanities. Unlike Binder, the environments you create on Reclaim Cloud will save your data – even if you switch them off! To run this repository on Reclaim Cloud for the first time:

Create a Reclaim Cloud account and log in.
Click on the button above to start the installation process.
A dialogue box will ask you to set a password, this is used to limit access to your Jupyter installation.
Sit back and wait for the installation to complete!
Once the installation is finished click on the 'Open in Browser' button of your newly created environment (note that you might need to wait a few minutes before everything is ready).

See Using Reclaim Cloud for more details.

Running in a container on your own computer¶

GLAM Workbench repositories are stored as pre-built container images on quay.io. You can run these containers on your own computer to set up a virtual machine with everything you need to use the notebooks. This is free, but requires more technical knowledge – you'll have to install Podman on your computer, and be able to use the command line.

Install Podman.

In a terminal, run the following command:

podman run --rm -p 8888:8888 quay.io/glamworkbench/trove-newspapers jupyter lab --ip=0.0.0.0 --port=8888 --ServerApp.token="" --LabApp.default_url="/lab/tree/index.ipynb"

It will take a while to download and configure the container image. Once it's ready you'll see a message saying that Jupyter Notebook is running.
Point your web browser to http://127.0.0.1:8888
When you've finished, download any files or data you want to keep from Jupyter Lab, and enter Ctrl+C int the terminal.

See Running in a container on your own computer for more details.

Setting up on your own computer¶

If you know your way around the command line and are comfortable installing software, you might want to set up your own computer to run these notebooks. You'll need to have recent versions of Python and Git installed. I use pyenv, pyenv-virtualenv, and pip-tools to create and manage Python versions and environments.

In a terminal:

Create a Python virtual environment (Python >= 3.10 should be ok): pyenv virtualenv 3.10.12 trove-newspapers
Activate the virtual environment: pyenv local trove-newspapers
Use git clone to create a local version of the GLAM Workbench repository: git clone https://github.com/GLAM-Workbench/trove-newspapers.git
Use cd to move into the newly-cloned folder: cd trove-newspapers
Run pip install pip-tools to install pip-tools.
Run pip-sync requirements.txt dev-requirements.txt to install the required Python packages.
Start Jupyter with jupyter lab – a browser window should open automatically. If not, copy and paste the url from the command line to your web browser.
To shut down your Jupyter Lab session enter Ctrl+C in the terminal.

See Using Python on your own computer for more details.

Contributors¶

Tim Sherratt

Cite as¶

Sherratt, Tim. (2024). GLAM-Workbench/trove-newspapers (version v2.0.0). Zenodo. https://doi.org/10.5281/zenodo.4724339