Trove newspaper & gazette harvester

Trove API access restrictions

With the cancellation of my Trove API keys by the National Library of Australia, I've made the difficult decision to stop work on Trove and archive all related code repositories.

The Trove sections of the GLAM Workbench will remain online, but they won't be updated. Everything here is openly-licensed, so feel free to take what’s useful and develop it further yourself.

Given the fact that the NLA is willing to change the API terms of use to restrict access without any consultation, provides no transparency around acceptable use of full text content, and is willing to cancel API keys without warning, I can no longer recommend Trove as a reliable source for digital research.

Current version: v2.1.1 ¶

The Trove Newspaper & Gazette Harvester makes it easy to download large quantities of digitised articles from Trove's newspapers and gazettes. Just give it a search from the Trove web interface, and the harvester will save the metadata of all the articles in a CSV (spreadsheet) file for further analysis.

See below for information on running these notebooks in a live computing environment. Or just take them for a spin using Binder.

Notebooks¶

Harvesting¶

Trove Newspaper & Gazette Harvester ¶

A simple web interface to the Trove Newspaper and Gazette Harvester – the easiest and quickest way to download all the results from a Trove newspaper or gazette search.

Using the Trove Harvester as a Python package ¶

This notebook uses the trove-newspaper-harvester Python package to harvest the complete results of a search in Trove's digitised newspapers or gazettes. The default settings will save both the article metadata and all of the OCRd text.

Using the Trove Newspaper Harvester on the command line ¶

The Trove Newspaper & Gazette Harvester is a command line tool and Python package that helps you download large quantities of digitised articles from Trove's newspapers and gazettes. This notebook demonstrates the basic use of the command line tool.

Harvesting articles that mention "Anzac Day" on Anzac Day ¶

The Trove Newspaper Harvester web app and command-line tool make it easy for you to harvest the results of a single search. But if you want to harvest very large or complex searches, you might find it easier to import the trove_newspaper_harvester library directly and take control of the harvesting process. For example, this notebook demonstrates how to harvest all of the newspaper articles mentioning 'Anzac Day' that were published on Anzac Day, 25 April.

Reshaping your newspaper harvest ¶

The Trove Newspaper Harvester downloads the OCRd text of newspaper articles as individual text files – one file for each article. That's great for exploring the content of individual articles in depth, but sometimes you might want to zoom out and aggregate the files into larger chunks. For example, if you're interested in how language changes over time, you might what to create a separate corpus for each year in the results set. Or perhaps you want to examine differences in the way particular newspapers talk about an event by grouping the articles by newspaper. This notebook provides a slice and dice wonder tool for Trove newspaper harvests, enabling you to repackage OCRd text by decade, year, and newspaper title. It saves the results as zip files, concatenated text files, or CSV files with embedded text. These repackaged slices should suit a variety of text analysis tools and questions.

Exploring¶

Display the results of a harvest as a searchable database using Datasette ¶

Datasette is 'a tool for exploring and publishing data'. Give it a CSV file and it turns it into a fully-searchable database, running in your browser. It supports facets, full-text search, and, with a bit of tweaking, can even present images. Although Datasette is a command-line tool, we can run from within a Jupyter notebook, and open a new window to display the results. This notebook shows you how to load the newspaper data you've harvested into Datasette, and start it up. If you've also harvested full-text and images from the newspaper articles, you can add these to your database as well!

Exploring your harvested data ¶

This notebook shows some ways in which you can analyse and visualise the article metadata you've harvested — show the distribution of articles over time and space; find which newspapers published the most articles. (Under construction)

Explore harvested text files ¶

This notebook suggests some ways in which you can aggregate and analyse the individual OCRd text files for each article — look at word frequencies ; calculate TF-IDF values. (Under construction)

Your harvested data¶

When you start a new harvest, the harvester by default looks for a directory called data. Within this directory it creates another directory for your harvest. The name of this directory will reflect the current date/time. The harvester saves your results inside this directory.

Info

You can customise the name and location of the harvest directory using either the Harvester command-line tool, or using the Harvester as a Python library. See the Trove Newspaper Harvester documentation for full details. The Harvesting articles that mention "Anzac Day" on Anzac Day provides an example of this.

There will be at least three files created for each harvest:

harvester_config.json a file that captures the parameters used to launch the harvest
ro-crate-metadata.json a metadata file documenting the harvest in RO-Crate format
results.csv contains details of all the harvested articles in a plain text CSV (Comma Separated Values) file. You can open it with any spreadsheet program.

The details recorded for each article are:

article_id – a unique identifier for the article
title – the title of the article
date – in ISO format, YYYY-MM-DD
page – page number (of course), but might also indicate the page is part of a supplement or special section
newspaper_id – a unique identifier for the newspaper or gazette title (this can be used to retrieve more information or build a link to the web interface)
newspaper_title – the name of the newspaper (or gazette)
category – one of ‘Article’, ‘Advertising’, ‘Detailed lists, results, guides’, ‘Family Notices’, or ‘Literature’
words – number of words in the article
illustrated – is it illustrated (values are y or n)
edition – edition of newspaper (rarely used)
supplement – section of newspaper (rarely used)
section – section of newspaper (rarely used)
url – the persistent url for the article
page_url – the persistent url of the page on which the article is published
snippet – short text sample
relevance – search relevance score of this result
status – some articles that are still being processed will have the status "coming soon" and might be missing other fields
corrections – number of text corrections
last_correction – date of last correction
tags – number of attached tags
comments – number of attached comments
lists – number of lists this article is included in
text – path to text file
pdf – path to PDF file
images – path to image file(s)

Info

If you use the harvester as a Python library, the metadata will be saved in a newline delimited JSON file (one JSON object per line) named results.ndjson, rather than results.csv. You can convert the ndjson file to CSV using the Harvester.save_csv() method. The results.ndjson stores the API results from Trove as is, with a couple of exceptions:

if the text parameter has been set to True, the articleText field will contain the path to a .txt file containing the OCRd text contents of the article (rather than containing the text itself)
similarly if PDFs and images are requests, the pdf and image fields int the ndjson file will point to the saved files.

If you’ve asked for text files PDFs or images, there will be additional directories containing those files. Files containing the OCRd text of the articles will be saved in a directory named text. These are just plain text files, stripped on any HTML. These files include some basic metadata in their file titles – the date of the article, the id number of the newspaper, and the id number of the article. So, for example, the filename 19460104-1002-206680758.txt tells you:

19460104 – the article was published on 4 January 1946 (YYYYMMDD)
1002 – the article was published in The Tribune
206680758 – the article's unique identifier

As you can see, you can use the newspaper and article ids to create direct links into Trove:

to a newspaper or gazette https://trove.nla.gov.au/newspaper/title/[newspaper id]
to an article http://nla.gov.au/nla.news-article[article id]

Similarly, if you've asked for copies of the articles as images, they'll be in a directory named image. The image file names are similar to the text files, but with an extra id number for the page from which the image was extracted. So, for example, the image filename 19250411-460-140772994-11900413.jpg tells you:

19250411 – the article was published on 11 April 1925 (YYYYMMDD)
460 – the article was published in The Australasian
140772994 – the article's unique identifier
11900413 – the page's unique identifier (some articles can be split over multiple pages)

Once you have your data you can start exploring! You'll find some Jupyter notebooks above that provide examples of analysing and visualising both the metadata and the full text.

Run these notebooks¶

There are a number of different ways to use these notebooks. Binder is quickest and easiest, but it doesn't save your data. I've listed the options below from easiest to most complicated (requiring more technical knowledge).

Using ARDC Binder¶

Click on the button above to launch the notebooks in this repository using the ARDC Binder service. This is a free service available to researchers in Australian universities. You'll be asked to log in with your university credentials. Note that sessions will close if you stop using the notebooks, and no data will be preserved. Make sure you download any changed notebooks or harvested data that you want to save.

See Using ARDC Binder for more details.

Using Binder¶

Click on the button above to launch the notebooks in this repository using the Binder service (it might take a little while to load). This is a free service, but note that sessions will close if you stop using the notebooks, and no data will be saved. Make sure you download any changed notebooks or harvested data that you want to save.

See Using Binder for more details.

Using Reclaim Cloud¶

Reclaim Cloud is a paid hosting service, aimed particularly at supported digital scholarship in hte humanities. Unlike Binder, the environments you create on Reclaim Cloud will save your data – even if you switch them off! To run this repository on Reclaim Cloud for the first time:

Create a Reclaim Cloud account and log in.
Click on the button above to start the installation process.
A dialogue box will ask you to set a password, this is used to limit access to your Jupyter installation.
Sit back and wait for the installation to complete!
Once the installation is finished click on the 'Open in Browser' button of your newly created environment (note that you might need to wait a few minutes before everything is ready).

See Using Reclaim Cloud for more details.

Running in a container on your own computer¶

GLAM Workbench repositories are stored as pre-built container images on quay.io. You can run these containers on your own computer to set up a virtual machine with everything you need to use the notebooks. This is free, but requires more technical knowledge – you'll have to install Podman on your computer, and be able to use the command line.

Install Podman.

In a terminal, run the following command:

podman run --rm -p 8888:8888 quay.io/glamworkbench/trove-newspaper-harvester jupyter lab --ip=0.0.0.0 --port=8888 --ServerApp.token="" --LabApp.default_url="/lab/tree/index.ipynb"

It will take a while to download and configure the container image. Once it's ready you'll see a message saying that Jupyter Notebook is running.
Point your web browser to http://127.0.0.1:8888
When you've finished, download any files or data you want to keep from Jupyter Lab, and enter Ctrl+C int the terminal.

See Running in a container on your own computer for more details.

Setting up on your own computer¶

If you know your way around the command line and are comfortable installing software, you might want to set up your own computer to run these notebooks. You'll need to have recent versions of Python and Git installed. I use pyenv, pyenv-virtualenv, and pip-tools to create and manage Python versions and environments.

In a terminal:

Create a Python virtual environment (Python >= 3.10 should be ok): pyenv virtualenv 3.10.12 trove-newspaper-harvester
Activate the virtual environment: pyenv local trove-newspaper-harvester
Use git clone to create a local version of the GLAM Workbench repository: git clone https://github.com/GLAM-Workbench/trove-newspaper-harvester.git
Use cd to move into the newly-cloned folder: cd trove-newspaper-harvester
Run pip install pip-tools to install pip-tools.
Run pip-sync requirements.txt dev-requirements.txt to install the required Python packages.
Start Jupyter with jupyter lab – a browser window should open automatically. If not, copy and paste the url from the command line to your web browser.
To shut down your Jupyter Lab session enter Ctrl+C in the terminal.

See Using Python on your own computer for more details.

Contributors¶

Tim Sherratt

Cite as¶

Sherratt, Tim. (2024). GLAM-Workbench/trove-newspaper-harvester (version v2.1.1). Zenodo. https://doi.org/10.5281/zenodo.11295552