Harvest Pandora subjects and collections¶
This notebook harvests Pandora's navigation hierarchy, saving the connections between subjects, collections, and titles.
The Pandora selective web archive assigns archived titles to subject and collection groupings. These curated collections help researchers find archived websites relating to specific topics or events, such as election campaigns. This notebook creates two datasets containing details of all Pandora's subjects and collections. The datasets can be used to assemble subject-based collections of archived websites for research.
Pandora vs Trove¶
The relationship between Pandora and Trove is a bit confusing. While the websites archived in Pandora are now part of the Australian Web Archive, and are searchable through Trove, not all of Pandora's metadata can be accessed through the Trove web interface.
Trove's Categories tab includes a link to Archived Webpage Collections. This collection hierarchy is basically the same as Pandora's – combining Pandora's subjects, subcategories, and collections into a single structure. However, it only includes links to titles that are part of collections. This is important, as less than half of Pandora's selected titles seem to be assigned to collections.
I originally started harvesting the collections from Trove, but eventually realised that I was missing out on titles that had been grouped by subject, but were not part of collections. As a result, I shifted approaches to scrape the data from Pandora directly.
Subjects, Collections, and Titles¶
There are two levels of subject headings in Pandora. The top-level headings are displayed on the Pandora home page, for example, Arts and Politics. The top-level headings can include sub-categories. For example, 'Arts' includes sub-categories for 'Architecture' and 'Dance'. Both the top-level subjects and sub-categories can include collections and titles.
Collections are more fine-grained groupings of titles, often related to specific events or activities. Collections can include sub-collections. In Pandora's web interface, the sub-collections are displayed as sub-headings on the collection page, but in the backend each sub-collection has its own identifier. For example, the 'Galleries' collection, includes a list of gallery websites divided into sub-collections by the state in which they're located. Both collections and sub-collections can contain titles.
Collections can appear in multiple subjects and sub-categories. This means that the harvesting process saves duplicate copies of collections that need to be removed.
Titles are also a type of group, bringing together webpage snapshots over time. They can also link urls where the addresses or domains of resources have changed. As a result, each title can be associated with multiple urls. This notebook doesn't harvest the full title details, it simply links title identifiers with subjects and collections. See Harvest the full collection of Pandora titles for more.
Titles can be linked to any level in this hierarchy. So to assemble a complete list of titles under a subject such as 'Arts', you need to get all the titles from 'Arts', all of the titles from all of the sub-categories under 'Arts', and all of the titles from all of the collections and sub-collections under both 'Arts' and its subcategories. See Create archived url datasets from Pandora's collections and subjects for an example of this.
For more on Pandora's approach to describing collections see Creating Structure in Web Archives With Collections: Different Concepts From Web Archivists.
Datasets¶
This notebook creates two datasets in ndjson
format (one JSON object per line):
pandora-subjects.ndjson
pandora-collections.ndjson
The pandora-subjects.ndjson
file includes the following fields:
name
– subject headingid
– subject identifier in the form/subject/[number]
subcategories
– list of subcategory identifierscollections
– list of collection identifierstitles
– list of title identifiers
The pandora-collections.ndjson
file includes the following fields:
name
– collection/subcollection nameid
– collection identifier in the form/col/[number]
subcollections
– list of subcollection identifierstitles
– list of title identifiers
Pre-harvested versions of these datasets are available from the Pandora collections data section of the GLAM Workbench.
import json
import os
import re
import time
from pathlib import Path
import pandas as pd
import requests
from bs4 import BeautifulSoup
from dotenv import load_dotenv
from tqdm.auto import tqdm
load_dotenv()
class SubjectHarvester:
def __init__(
self,
subject_output="pandora-subjects.ndjson",
collection_output="pandora-collections.ndjson",
sample=None,
):
self.subject_output = subject_output
self.collection_output = collection_output
self.sample = sample
def get_title_ids(self, page_id):
"""
Get the TEP identifiers for all the titles on the specified page.
Excludes titles in subcollections as they will can be harvested separately.
"""
title_ids = []
page = 1
# Subjects can have multiple pages of titles, so we'll go through page by page
# until there's no more titles
while page:
response = requests.get(f"http://pandora.nla.gov.au{page_id}/{page}")
soup = BeautifulSoup(response.text, "lxml")
# we only want the first itemlist containing titles
# subsequent titles will be part of subcollections
title_links = []
for item_list in soup.find_all("div", class_="itemlist"):
# This checks if the title list has an h1 tag before it
# which indicates its actually a subcollection
if not (
item_list.find_previous_sibling("h1")
and item_list.find_previous_sibling("h1").name == "h1"
):
# Extract the TEP ids from the links
title_links = item_list.find_all("a", href=re.compile(r"/tep/\d+"))
for title_link in title_links:
title_ids.append(title_link["href"])
# Continue if it's a subject page and there were title links on this page
if title_links and "/col/" not in page_id:
page += 1
else:
page = None
time.sleep(0.5)
return title_ids
def harvest_subcategories(self, subject_id):
"""
Harvest details of sub-categories from a subject page.
"""
subject_ids = []
# Get the subject page
response = requests.get(f"http://pandora.nla.gov.au{subject_id}")
soup = BeautifulSoup(response.text, "lxml")
# Get all the links to subcategories
subject_links = soup.find_all("a", href=re.compile(r"/subject/\d+$"))
# Process all the sub-categories
for subject_link in subject_links:
subject_name = " ".join(subject_link.stripped_strings)
subject_id = subject_link["href"]
# Get collections
collection_ids = self.harvest_collections(subject_id)
# Get titles
title_ids = self.get_title_ids(subject_id)
with Path(self.subject_output).open("a") as subjects_file:
subjects_file.write(
json.dumps(
{
"name": subject_name,
"id": subject_id,
"collections": collection_ids,
"titles": title_ids,
}
)
+ "\n"
)
subject_ids.append(subject_id)
return subject_ids
def harvest_subcollections(self, coll_id, coll_name):
"""
Harvest sub-collections from a collection page.
"""
collection_ids = []
# Get the collection page
response = requests.get(f"http://pandora.nla.gov.au{coll_id}")
soup = BeautifulSoup(response.text, "lxml")
# Sub-collections are included in the collection pages and identified with h1 headings.
# The h1 headings include a name attribute that is set to the sub-collection id.
# You can use the id to request a page that just has the subcollection.
# First get all the h1 tags
for subc in soup.find_all("h1"):
# Get the id value from the name attribute
sub_link = subc.find("a", {"name": re.compile(r"\d+")})
if sub_link:
sub_name = sub_link.string
# Add the collection name to the sub collection name (if it's not already there)
if coll_name not in sub_name:
sub_name = f"{coll_name} - {sub_name}"
# Use the sub-collection id to get a list of titles in the sub-collection
sub_id = f"/col/{sub_link['name']}"
title_ids = self.get_title_ids(sub_id)
with Path(self.collection_output).open("a") as collections_file:
collections_file.write(
json.dumps(
{
"name": sub_name,
"id": sub_id,
"titles": title_ids,
"subcollections": [],
}
)
+ "\n"
)
collection_ids.append(sub_id)
return collection_ids
def harvest_collections(self, subject_id):
"""
Harvest details of collections from a subject, or sub-category page.
"""
collection_ids = []
# Get the subject page
response = requests.get(f"http://pandora.nla.gov.au{subject_id}")
soup = BeautifulSoup(response.text, "lxml")
# Get all of the links to collection pages
collection_links = soup.find_all("a", href=re.compile(r"/col/\d+$"))
# Process each collection page
for coll_link in collection_links:
coll_name = " ".join(coll_link.stripped_strings)
coll_id = coll_link["href"]
# Get any sub-collections
subcollection_ids = self.harvest_subcollections(coll_id, coll_name)
# Get titles
title_ids = self.get_title_ids(coll_id)
with Path(self.collection_output).open("a") as collections_file:
collections_file.write(
json.dumps(
{
"name": coll_name,
"id": coll_id,
"subcollections": subcollection_ids,
"titles": title_ids,
}
)
+ "\n"
)
collection_ids.append(coll_id)
return collection_ids
def harvest(self):
"""
Start the harvest by getting the top-level subjects on the Pandora home page
and work down the hierarchy from there.
"""
# Remove old data files
Path(self.subject_output).unlink(missing_ok=True)
Path(self.collection_output).unlink(missing_ok=True)
# Get the Pandora home page
response = requests.get("http://pandora.nla.gov.au/")
soup = BeautifulSoup(response.text, "lxml")
# Find the list of subjects
subject_list = soup.find("div", class_="browseSubjects").find_all("li")
# Process each top-level subject
for subject in tqdm(subject_list[: self.sample]):
subject_link = subject.find("a")
subject_name = " ".join(subject_link.stripped_strings)
subject_id = subject_link["href"]
# Get subcategories
subcategory_ids = self.harvest_subcategories(subject_id)
# Get collections
subcollection_ids = self.harvest_collections(subject_id)
# Get titles
title_ids = self.get_title_ids(subject_id)
with Path(self.subject_output).open("a") as subjects_file:
subjects_file.write(
json.dumps(
{
"name": subject_name,
"id": subject_id,
"subcategories": subcategory_ids,
"collections": subcollection_ids,
"titles": title_ids,
}
)
+ "\n"
)
harvester = SubjectHarvester()
harvester.harvest()
Remove duplicate collections¶
Collections can appear under multiple subjects, so there will be duplicates in the collections dataset.
dfc = pd.read_json("pandora-collections.ndjson", lines=True)
dfc.shape
dfc.drop_duplicates(subset=["id"], inplace=True)
dfc.shape
dfc.to_json("pandora-collections.ndjson", orient="records", lines=True)
# IGNORE CELL --TESTING ONLY
if os.getenv("GW_STATUS") == "dev":
harvester = SubjectHarvester(
subject_output="pandora-subjects-test.ndjson",
collection_output="pandora-collections-test.ndjson",
sample=1,
)
harvester.harvest()
Path("pandora-subjects-test.ndjson").unlink(missing_ok=True)
Path("pandora-collections-test.ndjson").unlink(missing_ok=True)
Created by Tim Sherratt for the GLAM Workbench.