Using the Trove Newspaper Harvester on the command line¶

If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.

Some tips:

Code cells have boxes around them.
To run a code cell click on the cell and then hit Shift+Enter. The Shift+Enter combo will also move you to the next cell, so it's a quick way to work through the notebook.
While a cell is running a * appears in the square brackets next to the cell. Once the cell has finished running the asterisk will be replaced with a number.
In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.
To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.

The Trove Newspaper & Gazette Harvester is a command line tool and Python package that helps you download large quantities of digitised articles from Trove's newspapers and gazettes.

If you'd like to install and run the TroveHarvester on your local system see the package documentation.

This notebook demonstrates the basic use of the command line tool.

Getting started¶

Run the cell below to set some things up.

In [ ]:

import os
import shutil

from dotenv import load_dotenv
from IPython.display import HTML, display

load_dotenv()

If you were running TroveHarvester on your local system, you could access the basic help information by entering this on the command line:

troveharvester -h

In this notebook environment you need start with a ! to run the command-line TroveHarvester script. Click on the cell below and hit Shift+Enter to view the TroveHarvester's basic options.

In [10]:

!troveharvester -h

usage: troveharvester [-h] {start,restart,report} ...

positional arguments:
  {start,restart,report}
    start               start a new harvest
    restart             restart an unfinished harvest
    report              report on a harvest

options:
  -h, --help            show this help message and exit

Before we go any further you should make sure you have a Trove API key. For non-commercial projects, you just fill out a simple form and your API key is generated instantly. Follow the instructions in the Trove Help to obtain your own Trove API Key.

Once you've created a key, you can access it at any time on the 'For developers' tab of your Trove user profile.

Copy your API key now, and paste it in the cell below, between the quotes. Then hit Shift+Enter to save your key as a variable called api_key.

In [ ]:

# Insert your Trove API key
API_KEY = "YOUR API KEY"

# Use api key value from environment variables if it is available
if os.getenv("TROVE_API_KEY"):
    API_KEY = os.getenv("TROVE_API_KEY")

What do you want to harvest?¶

The TroveHarvester translates queries from the Trove web interface into something that the API can understand. So all you need to do is construct your query using the web interface. Once you're happy with the results you're getting just copy the url.

Once you've constructed your query and copied the url, paste it between the quotes in the cell below and hit Shift+Enter to save it as a variable.

In [ ]:

query = "https://trove.nla.gov.au/search/category/newspapers?keyword=wragge%201902&l-artType=newspapers&l-state=Queensland&l-title=840"

Running the harvest¶

By default the harvester will save all the article metadata to a CSV formatted file called results.csv. If you'd like to save the full OCRd text of all the articles, just add the --text parameter. If you'd like copies of the articles as JPG images, add the --image option. You can also save PDFs of all the articles by adding the --pdf parameter, but be warned that this will slow down your harvest considerably and can consume large amounts of disk space. So use with care!

Now we're ready to start the harvest! Just run the code in the cell below. You can delete the --text parameter if you're not interested in saving the full text of every article. You could also try adding --image to save articles as images (this will slow down the harvest).

In [ ]:

!troveharvester start "$query" $API_KEY --text

You'll know the harvest is finished when the asterisk in the square brackets of the cell above turns into a number.

If the harvest stops before it's finished, you can restart it by running the cell below.

In [ ]:

!troveharvester restart

If you want to check the details of a finished harvest, just run the cell below.

In [ ]:

!troveharvester report

Harvest results¶

You can also view the results of the harvest in the data directory.

See the GLAM Workbench for detailed information about the files included in each harvest.

Download your data¶

If you're using this notebook through the MyBinder service (it'll say `mybinder` in the url) make sure you download your data once the harvest is finished as it will not be preserved!

Once your harvest is complete, you probably want to download the results. The easiest way to do this is to zip up the results folder. Run the following cell to zip up the folder containing all the data from your most recent harvest.

In [ ]:

# List all the harvest folders and sort by date
harvests = sorted(
    [d for d in os.listdir("data") if os.path.isdir(os.path.join("data", d))]
)
# Get the most recent
timestamp = harvests[-1]
# Zip up the folder
shutil.make_archive(
    os.path.join("data", timestamp), "zip", os.path.join("data", timestamp)
)

Once your zip file has been created you can find it in the data directory. Or just run the cell below to create a handy download link.

In [ ]:

display(
    HTML(
        f'<a href="data/{timestamp}.zip" download="{timestamp}.zip">data/{timestamp}.zip</a>'
    )
)

Explore your data¶

Have a look at the Exploring your TroveHarvest data for some ideas.

Created by Tim Sherratt for the GLAM Workbench.