Harvest information about newspaper issues¶
When you search Trove's newspapers, you find articles – these articles are grouped by page, and all the pages from a particular date make up an issue. But how do you find out what issues are available? On what dates were newspapers published? This notebook shows how you can get information about issues from the Trove API.
The code below generates two datasets:
- Total number of issues per year for every newspaper – 27,615 rows with the fields:
title
– newspaper titletitle_id
– newspaper idstate
– place of publicationyear
– year publishedissues
– number of issues
- Complete list of issues for every newspaper – 2,655,664 rows with the fields:
title
– newspaper titletitle_id
– newspaper idstate
– place of publicationissue_id
– issue identifierissue_date
– date of publication (YYYY-MM-DD)
These datasets are harvested regularly, you can find the latest versions here:
- Total number of issues per year for each newspaper in Trove
- Complete list of issues for every newspaper in Trove
Issue urls¶
To keep the file size down, I haven't included an issue_url
in the issues dataset, but these are easily generated from the issue_id
. Just add the issue_id
to the end of http://nla.gov.au/nla.news-issue
. For example: http://nla.gov.au/nla.news-issue495426. Note that when you follow an issue url, you actually get redirected to the url of the first page in the issue.
import json
import os
from datetime import timedelta
import altair as alt
import arrow
import pandas as pd
import requests_cache
from dotenv import load_dotenv
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm
# Create a session that will automatically retry on server errors
s = requests_cache.CachedSession(expire_after=timedelta(days=30))
retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])
s.mount("http://", HTTPAdapter(max_retries=retries))
s.mount("https://", HTTPAdapter(max_retries=retries))
load_dotenv()
True
# Insert your Trove API key
API_KEY = "YOUR API KEY"
# Use api key value from environment variables if it is available
if os.getenv("TROVE_API_KEY"):
API_KEY = os.getenv("TROVE_API_KEY")
API_URL = "https://api.trove.nla.gov.au/v3/newspaper/title/"
PARAMS = {"encoding": "json"}
HEADERS = {"X-API-KEY": API_KEY}
Total number of issues per year for every newspaper in Trove¶
To get a list of all the newspapers in Trove you make a request to the newspaper/titles
endpoint. This provides summary information about each title, but no data about issues.
To get issue data you have to request information about each title separately, using the newspaper/title/[title id]
endpoint. If you add include=years
to the request, you get a list of years in which issues were published, and a total number of issues for each year. We can use this to aggregate information about the number of issues by title and year.
def get_issues_by_year():
"""
Gets the total number of issues per year for each newspaper.
Returns:
* A list of dicts, each containing the number of issues available from a newspaper in a particular year
"""
years = []
# First we get a list of all the newspapers (and gazettes) in Trove
response = s.get(
"https://api.trove.nla.gov.au/v3/newspaper/titles",
params=PARAMS,
headers=HEADERS,
)
data = response.json()
titles = data["newspaper"]
# Then we loop through all the newspapers to retrieve issue data
for title in tqdm(titles):
params = PARAMS.copy()
# This parameter adds the number of issues per year to the newspaper data
params["include"] = "years"
response = s.get(f'{API_URL}{title["id"]}', params=params, headers=HEADERS)
try:
data = response.json()
except json.JSONDecodeError:
print(response.url)
print(response.text)
else:
# Loop through all the years, saving the totals
for year in data["year"]:
years.append(
{
"title": title["title"],
"title_id": title["id"],
"state": title["state"],
"year": year["date"],
"issues": int(year["issuecount"]),
}
)
return years
issue_totals = get_issues_by_year()
0%| | 0/1812 [00:00<?, ?it/s]
# Save results as a dataframe
df_totals = pd.DataFrame(issue_totals)
df_totals.head()
title | title_id | state | year | issues | |
---|---|---|---|---|---|
0 | Canberra Community News (ACT : 1925 - 1927) | 166 | ACT | 1925 | 3 |
1 | Canberra Community News (ACT : 1925 - 1927) | 166 | ACT | 1926 | 12 |
2 | Canberra Community News (ACT : 1925 - 1927) | 166 | ACT | 1927 | 9 |
3 | Canberra Illustrated: A Quarterly Magazine (AC... | 165 | ACT | 1925 | 1 |
4 | Federal Capital Pioneer (Canberra, ACT : 1924 ... | 69 | ACT | 1924 | 1 |
How many issues are there?
df_totals["issues"].sum()
np.int64(2739444)
df_totals.shape
(29381, 5)
# Save as a CSV file
df_totals.to_csv(
f'newspaper_issues_totals_by_year_{arrow.now().format("YYYYMMDD")}.csv', index=False
)
Display the total number of issues per year¶
By grouping the number of issues by year, we can see how the number of issues in Trove changes over time. It's interesting to compare this to the number of articles over time.
# Group by year and calculate sum of totals
df_years = df_totals.groupby(by="year").sum().reset_index()
# Create a chart
alt.Chart(df_years).mark_bar(size=2).encode(
x=alt.X("year:Q", axis=alt.Axis(format="c")),
y="issues:Q",
tooltip=["year:O", "issues:Q"],
).properties(width=800)