Harvesting articles that mention "Anzac Day" on Anzac Day¶
The Trove Newspaper Harvester web app and command-line tool make it easy for you to harvest the results of a single search. But if you want to harvest very large or complex searches, you might find it easier to import the trove_newspaper_harvester
library directly and take control of the harvesting process.
For example, how would you harvest all of the newspaper articles mentioning "Anzac Day" that were published on Anzac Day, 25 April? It's possible to search for results from a single day using the date
index. So, theoretically, you could combine multiple dates using OR
and build a very long search query by doing something like this:
days = []
for year in range(1916, 1955):
days.append(f"date:[{year}-04-24T00:00:00Z TO {year}-04-25T00:00:00Z]")
query_string = f'"anzac day" AND ({" OR ".join(days)})'
However, if you try searching in Trove using the query string generated by this code it returns no results. Presumably it has hit a limit on query length. But even if you reduce the span of years you can get some odd results. It seems safer to search for each day independently, but how can you do that without manually creating lots of separate harvests?
The example below does the following:
- imports the
trove_newspaper_harvester
Harvester
class andprepare_query
function - uses
prepare_query
to create the basic set of parameters (without the date search) - loops through the desired span of years, adding the date search to the query, initialising the
Harvester
, running the harvest, and saving the results as a CSV file
It also uses the data_dir
and harvest_dir
parameters of Harvester
to tell it where to save the results. These options help you keep related searches together. In this instance, all the searches are saved in the anzac-day
parent directory, with each individual search saved in a directory named by the year of the search query. So you end up with one results directory for each year in the span. The separate results files can be easily combined, as shown below.
Set things up¶
import os
from pathlib import Path
import pandas as pd
# importing the trove_newspaper_harvester!
from trove_newspaper_harvester.core import Harvester, prepare_query
%%capture
# Load variables from the .env file if it exists
# Use %%capture to suppress messages
%load_ext dotenv
%dotenv
# Insert your Trove API key
API_KEY = "YOUR API KEY"
# Use api key value from environment variables if it is available
if os.getenv("TROVE_API_KEY"):
API_KEY = os.getenv("TROVE_API_KEY")
Run the harvester¶
First of all we use prepare_query
to create a base set of parameters. We'll feed it a search for the term "anzac day" and then add in the dates later.
query = 'https://trove.nla.gov.au/search/category/newspapers?keyword="anzac day"'
query_params = prepare_query(query=query)
Next we'll loop through our desired span of years, harvesting the results each Anzac Day. For demonstration purposes I'll use a short span, harvesting results for the years 1916 to 1919. But you could just as easily harvest results from 1916 to the present.
# Loop through the desired span of years
# Note that the end of the range is not inclusive, so you have to set it to the value above the end you want,
# so this loop will output 1916, 1917, 1918 and 1919, but not 1920.
for year in range(1916, 1920):
# Copy the base params
params = query_params.copy()
# Add the date search to the query string
params["q"] = (
f"{query_params['q']} date:[{year}-04-24T00:00:00Z TO {year}-04-25T00:00:00Z]"
)
# Initialise the harvester
# The data-dir parameter sets the parent directory, in this case "anzac-day"
# The harvest-dir parameter sets the directory, within the parent directory, where the current set of results will be saved,
# in this case the results directory will be named by the year
harvester = Harvester(
query_params=params, key=API_KEY, data_dir="anzac-day", harvest_dir=str(year)
)
# Harvest the results
harvester.harvest()
# Convert the JSON results to CSV
harvester.save_csv()
The result of this code will be a series of directories and files like this:
- anzac-day
- 1916
- results.csv
- ro-crate-metadata.json
- harvester_config.json
- results.ndjson
- 1917
- results.csv
- ro-crate-metadata.json
- harvester_config.json
- results.ndjson
- 1918
- results.csv
- ro-crate-metadata.json
- harvester_config.json
- results.ndjson
- 1919
- results.csv
- ro-crate-metadata.json
- harvester_config.json
- results.ndjson
Combine results¶
After harvesting the data above, the results for each year will be in a separate directory. If you want to join the result sets together, you can do something like this to create a single dataframe.
# A list to hold all the dataframes
dfs = []
# Loop through the span of years
for year in range(1916, 1920):
# Convert the results CSV file to a dataframe and add to the list of dfs
dfs.append(pd.read_csv(Path("anzac-day", str(year), "results.csv")))
# Combine the dataframes into one
df = pd.concat(dfs)
# View a sample
df.head()
article_id | title | date | page | newspaper_id | newspaper_title | category | words | illustrated | edition | ... | snippet | relevance | corrections | last_corrected | tags | comments | lists | text | images | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 102545637 | CHURCH OF ENGLAND. ANZAC DAY. | 1916-04-25 | 2 | 348 | Port Pirie Recorder and North Western Mail (SA... | Article | 52 | N | NaN | ... | There will be a Celebration of Holy Communion ... | 217.144120 | 0 | NaN | 0 | 0 | 0 | NaN | NaN | NaN |
1 | 102545649 | Advertising | 1916-04-25 | 2 | 348 | Port Pirie Recorder and North Western Mail (SA... | Advertising | 823 | N | NaN | ... | NaN | 0.871490 | 0 | NaN | 0 | 0 | 0 | NaN | NaN | NaN |
2 | 102545658 | ANZAC DAY AT PORT PIRIE. | 1916-04-25 | 2 | 348 | Port Pirie Recorder and North Western Mail (SA... | Article | 66 | N | NaN | ... | Anzac Day will be officially commemorated in t... | 217.742200 | 0 | NaN | 0 | 0 | 0 | NaN | NaN | NaN |
3 | 1040069 | Classified Advertising | 1916-04-25 | 2 | 10 | The Mercury (Hobart, Tas. : 1860 - 1954) | Advertising | 3816 | N | NaN | ... | NaN | 0.463448 | 0 | NaN | 0 | 0 | 0 | NaN | NaN | NaN |
4 | 1040072 | ANZAC DAY. | 1916-04-25 | 8 | 10 | The Mercury (Hobart, Tas. : 1860 - 1954) | Article | 89 | N | NaN | ... | May I inquire if on Anzac Day in Hobart any se... | 285.045350 | 0 | NaN | 0 | 0 | 0 | NaN | NaN | NaN |
5 rows × 24 columns
To make sure we have the combined results, we can look at the number of articles by each Anzac Day.
df["date"].value_counts()
date 1917-04-25 445 1918-04-25 384 1919-04-25 344 1916-04-25 315 Name: count, dtype: int64
Created by Tim Sherratt (@wragge) for the GLAM Workbench.
Support this project by becoming a GitHub sponsor.