Skip to content

Trove journals

Current version: v0.1.0

Trove's 'journals' zone includes journals and journal articles, as well as other research outputs and things like press releases. You can access metadata from the journal zone through the Trove API, but to get text and images you need to use some screen scraping.

Binder

Harvesting metadata

Create a list of Trove's digitised journals

Everyone know's about Trove's newspapers, but there is also a growing collection of digitised journals available in the journals zone. They're not easy to find, however, which is why I created the Trove Titles web app. This notebook uses the Trove API to harvest metadata relating to digitised journals – or more accurately, journals that are freely available online in a digital form. This includes some born digital publications that are available to view in formats like PDF and MOBI, but excludes some digital journals that have access restrictions.

Harvesting text

Get OCRd text from a digitised journal in Trove

Many of the digitised journals available in Trove make OCRd text available for download – one text file for each journal issue. However, while there are records for journals and articles in Trove (and available through the API), there are no records for issues. So how do we find them? This notebook shows how to extract issue data from a digitised journal and download OCRd text for each issue.

Download the OCRd text for ALL the digitised journals in Trove!

Using the code and data from the previous two notebooks, you can download the OCRd text from every digitised journal. If you're going to try this, you'll need a lots of patience and lots of disk space. Needless to say, don't try this on a cloud service like Binder. Fortunately you don't have to do it yourself, as I've already run the harvest and made all the text files available. See below for details. I repeat, you probably don't want to do this yourself. The point of this notebook is really to document the methodology used to create the repository.

Harvest parliament press releases from Trove

Trove includes more than 380,000 press releases, speeches, and interview transcripts issued by Australian federal politicians and saved by the Parliamentary Library. You can view them all in Trove by searching for nuc:"APAR:PR" in the books & libraries category. This notebook shows you how to harvest both metadata and full text from a search of the parliamentary press releases. The metadata is available from Trove, but to get the full text we have to go back to the Parliamentary Library's database, ParlInfo.

Harvesting images

Get covers (or any other pages) from a digitised journal in Trove

In another notebook, I showed how to get issue metadata and OCRd texts from a digitised journal in Trove. It's also possible to download page images and PDFs. This notebook shows how to download all the cover images from a specified journal. With some minor modifications you could download any page, or range of pages.

Finding editorial cartoons in the Bulletin

In another notebook I showed how you could download all the front pages of The Bulletin (and other journals) as images. Amongst the front pages you'll find a number of full page editorial cartoons under The Bulletin's masthead. But you'll also find that many of the front pages are advertising wrap arounds. The full page editorial cartoons were a consistent feature of The Bulletin for many decades, but they moved around between pages one and eleven. That makes them hard to find. I wanted to try and assemble a collection of all the editorial cartoons, but how?

Exploring harvested text

Topic Modelling of Australian Parliamentary Press Releases by Adel Rahmani

This notebook explores the Politicians talking about 'immigrants' and 'refugees' collection of press releases (see below). Adel notes: 'I was curious about the contents of the press releases, however, at more than 12,000 documents the collection is too overwhelming to read through, so I thought I'd get the computer to do it for me, and use topic modelling to poke aroung the corpus.'

Datasets

CSV formatted list of journals available from Trove in digital form

Harvested: 5 August 2021

This file provides metadata of 7,269 periodicals that are available from Trove's journal zone in digital form. This includes both 'digitised' periodicals, and born-digital periodicals submitted through Electronic Legal Deposit. Note that this list contains 7,318 records as there are some duplicates where multiple Trove work records point to the same digitised periodicals. The duplicates have been left in as they include different metadata, and can be easily removed with Pandas.

CSV formatted list of journals with OCRd text

Harvested: 5 August 2021

This file provides metadata of 1,163 digitised periodicals in Trove that have OCRd text for download.

OCRd text from Trove digitised journals

Harvested: 5 August 2021

Using the notebooks above I harvested metadata and OCRd text from Trove's digitised periodicals.

  • 1,163 periodicals had OCRd text available for download
  • OCRd text was downloaded from 51,928 periodical issues
  • About 10gb of text was downloaded

Government publications in digital form

Harvested: 5 August 2021

This dataset combines records from the separate harvests of books and periodicals available in digital form that have the type 'Government publication'.

Editorial cartoons from The Bulletin, 1886 to 1952

Harvested: 9 May 2019

Using the notebook above I downloaded at least one full page editorial cartoon for every issue of The Bulletin from 4 September 1886 to 17 September 1952. In total there are 3,471 images (approximately 60gb).

Politicians talking about 'immigrants' and 'refugees'

Using the notebook above I harvested parliamentary press releases that included any of the terms 'immigrant', 'asylum seeker', 'boat people', 'illegal arrivals', or 'boat arrivals'. A total of 12,619 text files were harvested.

Politicians talking about COVID

Using the notebook above I harvested parliamentary press releases that included any of the terms 'COVID' or 'coronavirus'. A total of 3,995 text files were harvested.

Contributors

Cite as

Sherratt, Tim. (2019). GLAM-Workbench/trove-journals (version v0.1.0). Zenodo. https://doi.org/10.5281/zenodo.3545216

Back to top