Using the Trove Harvester as a Python package¶

This notebook uses the trove-newspaper-harvester Python package to harvest the complete results of a search in Trove's digitised newspapers or gazettes. The default settings will save both the article metadata and all of the OCRd text.

If you want to run your own harvest:

copy and paste your Trove API key where indicated below
construct your search query in Trove then copy and paste the query url where indicated below
adjust harvest options if desired
from the 'Run' menu select 'Run All Cells'

Once the harvest has finished a download link will be displayed. You can also view the results of the harvest in the data directory. The GLAM Workbench includes detailed information about the files included in each harvest.

In [ ]:

import os
import shutil
from pathlib import Path

from dotenv import load_dotenv
from IPython.display import HTML, display
from trove_newspaper_harvester.core import Harvester, get_harvest, prepare_query

load_dotenv()

Set your Trove API key¶

You need to have a Trove API key to use the harvester. For non-commercial projects, you just fill out a simple form and your API key is generated instantly. Follow the instructions in the Trove Help to obtain your own Trove API Key.

Once you've created a key, you can access it at any time on the 'For developers' tab of your Trove user profile.

Copy your API key now, and paste it in the cell below, between the quotes.

In [ ]:

# Insert your Trove API key
API_KEY = "YOUR API KEY"

# Ignore this -- it will get and api key value from environment variables if available
if os.getenv("TROVE_API_KEY"):
    API_KEY = os.getenv("TROVE_API_KEY")

Set your search query¶

The Trove Harvester translates queries from the Trove web interface into something that the API can understand. So all you need to do is construct your query using the web interface. Once you're happy with the results you're getting just copy the url and paste it between the quotes in the cell below.

In [ ]:

query = "https://trove.nla.gov.au/search/category/newspapers?keyword=%22octopus%20intelligence%22"

Adjust options if necessary¶

The Newspaper Harvester accepts a number of options, these include:

text: save articles as text files, True or False
pdf: save articles as PDFs, True or False
image: save articles as images, True or False
config_file: path to a config file generated by a previous harvest
data_dir: directory for harvests
harvest_dir: directory for this harvest
include_linebreaks: keep linebreaks in text files, True or False

The cell below sets some default values for these options. By default, your harvest will include the OCRd text of all articles and will be saved in the data directory in a sub-directory named used the current date and time. Edit the cell below if you want to change this.

In [ ]:

# What to save
# Setting pdf or image to True will greatly increase the harvest time
text = True  # Include the OCRd text
pdf = False
image = False

# Use an existing config file (from a previous harvest)
# set to the path of the config file
config_file = None

# Where to save the results
# data_dir contains multiple harvests
data_dir = "data"
# harvest_dir contains a single harvest
harvest_dir = None

# Text options
# line breaks are stripped unless this is set to True
include_linebreaks = False

Set up and run the Harvester¶

You shouldn't need to change anthing in the cells below.

In [ ]:

params = prepare_query(query=query)
harvester = Harvester(
    query_params=params,
    key=API_KEY,
    data_dir=data_dir,
    harvest_dir=harvest_dir,
    config_file=config_file,
    text=text,
    pdf=pdf,
    image=image,
    include_linebreaks=include_linebreaks,
)

If for some reason the harvest stops before it is finished (Trove goes down or your internet fails), click on the cell below and select 'Run Selected Cell and All Below' from the 'Run' menu. This will pick the harvest up from where it stopped.

In [ ]:

harvester.harvest()

In [ ]:

harvest = get_harvest()
harvester.save_csv()
Path(harvest, "results.ndjson").unlink()
harvester.remove_ndjson_from_crate()
shutil.make_archive(harvest, "zip", harvest)
display(
    HTML(
        f'<b>Download results</b>: <a href="{str(harvest)}.zip" download>{str(harvest)}.zip</a>'
    )
)

Next steps¶

for some tools or questions you'll probably want to create slices or filtered subsets of the complete harvest – see Reshaping your newspaper harvest for some examples
start exploring your results by examining the metadata and the OCRd text (these notebooks are incomplete, but should give you an idea of the possibilities)

Created by Tim Sherratt for the GLAM Workbench.