Harvest the full collection of Pandora titles¶
This notebook harvests a complete collection of archived web page titles from Pandora, the National Library of Australia's selective web archive.
Pandora has been selecting web sites and online resources for preservation since 1996. It has assembled a collection of more than 80,000 titles, organised into subjects and collections. The archived websites are now part of the Australian Web Archive (AWA), which combines the selected titles with broader domain harvests, and is searchable through Trove. However, Pandora's curated collections offer a useful entry point for researchers trying to find web sites relating to particular topics or events.
By combining the list of titles with data harvested from Pandora's hierarchy of subjects and collections, you can create datasets of archived urls relating to specific topics.
What are titles?¶
Pandora's 'titles' are not single resources, they're groups of resources. Titles link to snapshots of a web resource captured on different dates (also known as Mementos). Titles also bring together different urls or domains that have pointed to the resource over time. This means that each title can be linked to multiple urls. This notebook unpacks the title records to create an entry for each archived url.
Harvesting method¶
There are two main processes used to harvest the data:
- scraping Pandora's complete list of titles to save the link and name for each title
- requesting a machine-readable version of the Title Entry Page (TEP) for each title and saving all the archived urls grouped within the title
The title links have the form /tep/[TEP number]
and lead to a human-readable Title Entry Page in Trove. However, by changing the url, you can get a JSON version of the TEP. For example:
- https://webarchive.nla.gov.au/tep/131444 – goes to TEP web page
- https://webarchive.nla.gov.au/bamboo-service/tep/131444 – returns JSON version of TEP
The JSON data includes a list of instances that point to individual snapshots (or Mementos) of the title. As far as I can tell, the TEPs only include snapshots captured through Pandora's selective archiving processes. Additional snapshots of a resource might have been captured by a domain crawl and included in the Australian Web Archive. A complete list of captures can be retrieved by using the url of the archived resource to request a Timemap.
The harvesting process attempts to extract all the archived urls from the gatheredUrl
field in the instance data. However, it seems that when Pandora snapshots are migrated to the AWA, the gatheredUrl
value is set to point to the snapshot, rather than the url of the original resource. The original url is embedded in the snapshot url, so the harvesting process extracts it using regular expressions.
The urls extracted from each title record are de-duplicated, and each unique value is saved as a separate row in the resulting dataset. This means there can be multiple records for each title.
Dataset structure¶
The dataset includes a row for each unique url from each title. The fields are:
tep_id
– the TEP identifier in the form/tep/[TEP NUMBER]
name
– name of the titlegathered_url
– the url that was archivedsurt
– the surt (Sort-friendly URI Reordering Transform) is a version of the url that reverses the order of the domain components to put the top-level domain first, making it easier to group or sort resources by domain
A pre-harvested version of this dataset is available from the Pandora titles data repository.
import json
import os
import re
import time
from pathlib import Path
import pandas as pd
import requests
import requests_cache
from bs4 import BeautifulSoup
from dotenv import load_dotenv
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from surt import surt
from tqdm.auto import tqdm
s = requests_cache.CachedSession("titles.db")
retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])
s.mount("https://", HTTPAdapter(max_retries=retries))
s.mount("http://", HTTPAdapter(max_retries=retries))
load_dotenv()
def harvest_titles(output="titles_all.ndjson", sample_only=False):
"""
Scrapes details of all titles from the Pandora website.
"""
Path(output).unlink(missing_ok=True)
page = 1
with tqdm() as pbar:
# Continue harvesting page by page until there's no results
while page:
# Request a page of title links
response = requests.get(f"http://pandora.nla.gov.au/alpha/ALL/{page}")
soup = BeautifulSoup(response.text, "lxml")
title_links = []
with Path(output).open("a") as titles_file:
# Find all the item lists on the page and loop through them
for item_list in soup.find_all("div", class_="itemlist"):
# Get all the tep links
title_links = item_list.find_all("a", href=re.compile(r"/tep/\d+"))
# Save the tep id and name
for title_link in title_links:
titles_file.write(
json.dumps(
{
"tep_id": title_link["href"],
"name": title_link.string,
}
)
+ "\n"
)
pbar.update(1)
# If there's title links on this page, increment the page value and continue
if title_links and not sample_only:
page += 1
# If there's no title links then stop harvesting
else:
page = None
time.sleep(0.5)
harvest_titles()
Extract archived urls from TEP¶
Now we'll request data for each TEP and extract the archived urls.
def clean_url(url):
"""
Get the harvested url from a Pandora snapshot link.
"""
match = re.search(r"^/?[A-Z0-9]*/?[A-Za-z0-9-]+/", url)
if match:
url = url[match.end() :]
if not url.startswith("http"):
url = f"http://{url}"
return url
def add_title_urls(input="titles_all.ndjson", output="title_urls.ndjson"):
with Path(input).open("r") as input_file:
with Path(output).open("w") as output_file:
for line in tqdm(input_file):
tep_data = json.loads(line)
# Get TEP JSON
url = (
f"https://webarchive.nla.gov.au/bamboo-service{tep_data['tep_id']}"
)
response = s.get(url)
# Some TEPs produce 500 errors -- seems they're no longer in the archive?
if response.ok:
data = response.json()
instance_urls = []
# Title record includes multiple instances
# An instance can be a different url, or a Pandora snapshot
# We want to get all the distinct urls, so we'll trim the Pandora bits from urls and
# use surts to merge http, https, www addresses
surts = []
for instance in data["instances"]:
# First we'll use the `gatheredUrl` field
if gathered_url := instance.get("gatheredUrl"):
# Remove the Pandora part of the url (if there is one)
gathered_url = clean_url(gathered_url)
try:
tep_surt = surt(gathered_url)
# This is to handle a broken url
except ValueError:
gathered_url = gathered_url.replace(
"http://https:", "http://"
)
tep_surt = surt(gathered_url)
# If there's no `gatheredUrl`, we'll use the `url`
elif tep_url := instance.get("url"):
# Remove Pandora part of link
gathered_url = re.search(
r"http://pandora.nla.gov.au/pan/\w+/\w+-\w+/(.*)",
tep_url,
).group(1)
if not gathered_url.startswith("http"):
gathered_url = f"http://{gathered_url}"
tep_surt = surt(gathered_url)
else:
tep_surt = None
# Add url to list if we don't already have it (check surts)
if tep_surt and tep_surt not in surts:
instance_urls.append(gathered_url)
surts.append(tep_surt)
# Save each url
for instance_url in sorted(set(instance_urls)):
tep_data["gathered_url"] = instance_url
tep_data["surt"] = surt(instance_url)
output_file.write(json.dumps(tep_data) + "\n")
if not response.from_cache:
time.sleep(0.5)
else:
output_file.write(json.dumps(tep_data) + "\n")
add_title_urls()
dft = pd.read_json("title_urls.ndjson", lines=True)
dft.to_csv("pandora-titles.csv", index=False, encoding="utf-8-sig")
# IGNORE THIS CELL -- TESTING ONLY
if os.getenv("GW_STATUS") == "dev":
harvest_titles(output="test.ndjson", sample_only=True)
add_title_urls(input="test.ndjson", output="test_urls.ndjson")
Path("test.ndjson").unlink()
Path("test_urls.ndjson").unlink()
Created by Tim Sherratt for the GLAM Workbench.