Download issues of a periodical as PDFs¶
This notebook helps you download the issues of a digitised periodical as PDFs. You can download all digitised issues, or specify a range of years to include.
There are three main steps:
- get a list of all the
nla.obj
identifiers of the periodical's issues - get the number of pages in each issue
- construct a url to download each issue as a PDF using the
nla.obj
identifier and the number of pages
Depending on the periodical, this could take many hours to complete and consume a lot of disk space.
# Let's import the libraries we need.
import json
import re
import time
from datetime import timedelta
from pathlib import Path
import arrow
import pandas as pd
import requests_cache
from bs4 import BeautifulSoup
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm
s = requests_cache.CachedSession(expire_after=timedelta(days=30))
retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])
s.mount("https://", HTTPAdapter(max_retries=retries))
s.mount("http://", HTTPAdapter(max_retries=retries))
Set your parameters¶
Edit the cell below to insert the nla.obj
identifier of the periodical. This identifier will point to the top-level collection page of a periodical in Trove's digitised object viewer. For example, the url of the top-level page of The Bulletin is https://nla.gov.au/nla.obj-68375465, so the identifier is nla.obj-68375465
. If you're viewing a digitised issue or page within a periodical, you can use the breadcrumbs link to navigate up to the top-level page.
Finding digitised periodicals in Trove is not always easy – the Trove Data Guide provides some hints. You can also search this database of digitised periodical titles.
By default, this notebook will download all the issues of a periodical. If you only want issues from a particular range of years, set the start_year
and/or end_year
values in the cell below. For exammple, setting start_year = 1900
and end_year = 1940
will download all issues published between 1900 and 1940.
Once you've made your changes to the cell below, select 'Run > Run All Cells' from the menu.
# Insert the periodical's nla.obj identifier
periodical_id = "nla.obj-8423556"
# Optionally set a range of years
start_year = None
end_year = None
Get issue identifiers¶
Version 3 of the Trove API added a new endpoint to provide information about periodical titles and issues. However, the issues data provided by the API is incomplete. A more reliable alternative is to scrape the list of issues from the browse window in the digitised object viewer – see HOW TO: Get a list of items from a digitised collection in the Trove Data Guide.
def get_issue_ids(periodical_id):
# The initial startIdx value
start = 0
# Number of results per page, used to increment the startIdx value
n = 20
items = []
with tqdm() as pbar:
# If there aren't 20 results on the page then we've reached the end, so continue harvesting until that happens.
while n == 20:
url = f"https://nla.gov.au/{periodical_id}/browse?startIdx={start}&rows=20&op=c"
# Get the browse page
response = s.get(url)
# Beautifulsoup turns the HTML into an easily navigable structure
soup = BeautifulSoup(response.text, "html.parser")
# Find all the divs containing issue details and loop through them
details = soup.find_all(class_="l-item-info")
for detail in details:
# Look for the a tag with class "obj-reference content"
item_id = detail.find(
lambda tag: tag.name == "a"
and tag.get("class") == ["obj-reference", "content"]
)["href"].strip("/")
# Save the issue id
items.append(item_id)
if not response.from_cache:
time.sleep(0.2)
# Increment the startIdx
start += n
# Set n to the number of results on the current page
n = len(details)
pbar.update(n)
return items
issue_ids = get_issue_ids(periodical_id)
Get number of pages in each issue¶
It's possible to scrape the number of pages along with the identifiers in the previous step. However, I'm not certain that the information is displayed consistently across all periodicals. To play it safe, I'm extracting embedded metadata from the digitised object viewer and getting the number of pages, issue dates, and publication details (if available). See HOW TO: Extract additional metadata from the digitised resource viewer in the Trove Data Guide.
def get_metadata(id):
"""
Extract work data in a JSON string from the work's HTML page.
"""
if not id.startswith("http"):
id = "https://nla.gov.au/" + id
response = s.get(id)
try:
work_data = re.search(
r"var work = JSON\.parse\(JSON\.stringify\((\{.*\})", response.text
).group(1)
except AttributeError:
work_data = "{}"
if not response.from_cache:
time.sleep(0.2)
return json.loads(work_data)
def get_issue_data(issue_ids):
issues = []
for issue_id in tqdm(issue_ids):
metadata = get_metadata(issue_id)
date = metadata.get("issueDate", "")
try:
iso_date = arrow.get(date, "ddd, D MMM YYYY").format("YYYY-MM-DD")
except arrow.parser.ParserMatchError:
iso_date = ""
issue = {
"id": issue_id,
"date": date,
"iso_date": iso_date,
"details": metadata.get("subUnitNo", ""),
"pages": len(metadata["children"]["page"]),
}
issues.append(issue)
return issues
issues = get_issue_data(issue_ids)
Download PDFs¶
Once we have the identifier and number of pages we can construct a url to download each issue. See: HOW TO: Get text, images, and PDFs using Trove’s download link in the Trove Data Guide.
The downloaded PDFs will be saved in the pdfs
directory, within a subdirectory named using the periodical's nla.obj
identifier.
def filter_issues(issues, start_year=None, end_year=None):
filtered = []
if not (start_year or end_year):
return issues
for issue in issues:
year = issue["iso_date"][:4]
if year:
year = int(year)
if start_year and end_year:
if year >= start_year and year <= end_year:
filtered.append(issue)
elif start_year:
if year >= start_year:
filtered.append(issue)
elif end_year:
if year <= end_year:
filtered.append(issue)
return filtered
def download_pdfs(issues, start_year=None, end_year=None):
output_dir = Path("pdfs", periodical_id)
output_dir.mkdir(exist_ok=True, parents=True)
for issue in tqdm(filter_issues(issues, start_year, end_year)):
pdf_url = f"https://nla.gov.au/{issue['id']}/download?downloadOption=pdf&firstPage=0&lastPage={issue['pages']-1}"
response = s.get(pdf_url, stream=True)
if issue["iso_date"]:
filename = f"{issue['iso_date']}-{issue['id']}.pdf"
Path(output_dir, filename).write_bytes(response.content)
if not response.from_cache:
time.sleep(1)
download_pdfs(issues, start_year, end_year)
Save metadata¶
You can also save the harvested issue metadata as a CSV file. It will be saved in the same directory as the PDFs.
name_parts = [str(p) for p in [periodical_id, "issues", start_year, end_year] if p]
csv_filename = f"{'-'.join(name_parts)}.csv"
df = pd.DataFrame(issues)
df.to_csv(Path("pdfs", periodical_id, csv_filename, index=False))
Created by Tim Sherratt for the GLAM Workbench.