Harvest Pandora subjects and collections¶

This notebook harvests Pandora's navigation hierarchy, saving the connections between subjects, collections, and titles.

The Pandora selective web archive assigns archived titles to subject and collection groupings. These curated collections help researchers find archived websites relating to specific topics or events, such as election campaigns. This notebook creates two datasets containing details of all Pandora's subjects and collections. The datasets can be used to assemble subject-based collections of archived websites for research.

Pandora vs Trove¶

The relationship between Pandora and Trove is a bit confusing. While the websites archived in Pandora are now part of the Australian Web Archive, and are searchable through Trove, not all of Pandora's metadata can be accessed through the Trove web interface.

Trove's Categories tab includes a link to Archived Webpage Collections. This collection hierarchy is basically the same as Pandora's – combining Pandora's subjects, subcategories, and collections into a single structure. However, it only includes links to titles that are part of collections. This is important, as less than half of Pandora's selected titles seem to be assigned to collections.

I originally started harvesting the collections from Trove, but eventually realised that I was missing out on titles that had been grouped by subject, but were not part of collections. As a result, I shifted approaches to scrape the data from Pandora directly.

Subjects, Collections, and Titles¶

There are two levels of subject headings in Pandora. The top-level headings are displayed on the Pandora home page, for example, Arts and Politics. The top-level headings can include sub-categories. For example, 'Arts' includes sub-categories for 'Architecture' and 'Dance'. Both the top-level subjects and sub-categories can include collections and titles.

Collections are more fine-grained groupings of titles, often related to specific events or activities. Collections can include sub-collections. In Pandora's web interface, the sub-collections are displayed as sub-headings on the collection page, but in the backend each sub-collection has its own identifier. For example, the 'Galleries' collection, includes a list of gallery websites divided into sub-collections by the state in which they're located. Both collections and sub-collections can contain titles.

Collections can appear in multiple subjects and sub-categories. This means that the harvesting process saves duplicate copies of collections that need to be removed.

Titles are also a type of group, bringing together webpage snapshots over time. They can also link urls where the addresses or domains of resources have changed. As a result, each title can be associated with multiple urls. This notebook doesn't harvest the full title details, it simply links title identifiers with subjects and collections. See Harvest the full collection of Pandora titles for more.

Titles can be linked to any level in this hierarchy. So to assemble a complete list of titles under a subject such as 'Arts', you need to get all the titles from 'Arts', all of the titles from all of the sub-categories under 'Arts', and all of the titles from all of the collections and sub-collections under both 'Arts' and its subcategories. See Create archived url datasets from Pandora's collections and subjects for an example of this.

For more on Pandora's approach to describing collections see Creating Structure in Web Archives With Collections: Different Concepts From Web Archivists.

Datasets¶

This notebook creates two datasets in ndjson format (one JSON object per line):

  • pandora-subjects.ndjson
  • pandora-collections.ndjson

The pandora-subjects.ndjson file includes the following fields:

  • name – subject heading
  • id – subject identifier in the form /subject/[number]
  • subcategories – list of subcategory identifiers
  • collections – list of collection identifiers
  • titles – list of title identifiers

The pandora-collections.ndjson file includes the following fields:

  • name – collection/subcollection name
  • id – collection identifier in the form /col/[number]
  • subcollections – list of subcollection identifiers
  • titles – list of title identifiers

Pre-harvested versions of these datasets are available from the Pandora collections data section of the GLAM Workbench.

In [ ]:
import json
import os
import re
import time
from pathlib import Path

import pandas as pd
import requests
from bs4 import BeautifulSoup
from dotenv import load_dotenv
from tqdm.auto import tqdm

load_dotenv()
In [29]:
class SubjectHarvester:

    def __init__(
        self,
        subject_output="pandora-subjects.ndjson",
        collection_output="pandora-collections.ndjson",
        sample=None,
    ):
        self.subject_output = subject_output
        self.collection_output = collection_output
        self.sample = sample

    def get_title_ids(self, page_id):
        """
        Get the TEP identifiers for all the titles on the specified page.
        Excludes titles in subcollections as they will can be harvested separately.
        """
        title_ids = []
        page = 1
        # Subjects can have multiple pages of titles, so we'll go through page by page
        # until there's no more titles
        while page:
            response = requests.get(f"http://pandora.nla.gov.au{page_id}/{page}")
            soup = BeautifulSoup(response.text, "lxml")
            # we only want the first itemlist containing titles
            # subsequent titles will be part of subcollections
            title_links = []
            for item_list in soup.find_all("div", class_="itemlist"):
                # This checks if the title list has an h1 tag before it
                # which indicates its actually a subcollection
                if not (
                    item_list.find_previous_sibling("h1")
                    and item_list.find_previous_sibling("h1").name == "h1"
                ):
                    # Extract the TEP ids from the links
                    title_links = item_list.find_all("a", href=re.compile(r"/tep/\d+"))
                    for title_link in title_links:
                        title_ids.append(title_link["href"])
            # Continue if it's a subject page and there were title links on this page
            if title_links and "/col/" not in page_id:
                page += 1
            else:
                page = None
            time.sleep(0.5)
        return title_ids

    def harvest_subcategories(self, subject_id):
        """
        Harvest details of sub-categories from a subject page.
        """
        subject_ids = []
        # Get the subject page
        response = requests.get(f"http://pandora.nla.gov.au{subject_id}")
        soup = BeautifulSoup(response.text, "lxml")
        # Get all the links to subcategories
        subject_links = soup.find_all("a", href=re.compile(r"/subject/\d+$"))
        # Process all the sub-categories
        for subject_link in subject_links:
            subject_name = " ".join(subject_link.stripped_strings)
            subject_id = subject_link["href"]
            # Get collections
            collection_ids = self.harvest_collections(subject_id)
            # Get titles
            title_ids = self.get_title_ids(subject_id)
            with Path(self.subject_output).open("a") as subjects_file:
                subjects_file.write(
                    json.dumps(
                        {
                            "name": subject_name,
                            "id": subject_id,
                            "collections": collection_ids,
                            "titles": title_ids,
                        }
                    )
                    + "\n"
                )
            subject_ids.append(subject_id)
        return subject_ids

    def harvest_subcollections(self, coll_id, coll_name):
        """
        Harvest sub-collections from a collection page.
        """
        collection_ids = []
        # Get the collection page
        response = requests.get(f"http://pandora.nla.gov.au{coll_id}")
        soup = BeautifulSoup(response.text, "lxml")
        # Sub-collections are included in the collection pages and identified with h1 headings.
        # The h1 headings include a name attribute that is set to the sub-collection id.
        # You can use the id to request a page that just has the subcollection.
        # First get all the h1 tags
        for subc in soup.find_all("h1"):
            # Get the id value from the name attribute
            sub_link = subc.find("a", {"name": re.compile(r"\d+")})
            if sub_link:
                sub_name = sub_link.string
                # Add the collection name to the sub collection name (if it's not already there)
                if coll_name not in sub_name:
                    sub_name = f"{coll_name} - {sub_name}"
                # Use the sub-collection id to get a list of titles in the sub-collection
                sub_id = f"/col/{sub_link['name']}"
                title_ids = self.get_title_ids(sub_id)
                with Path(self.collection_output).open("a") as collections_file:
                    collections_file.write(
                        json.dumps(
                            {
                                "name": sub_name,
                                "id": sub_id,
                                "titles": title_ids,
                                "subcollections": [],
                            }
                        )
                        + "\n"
                    )
                collection_ids.append(sub_id)
        return collection_ids

    def harvest_collections(self, subject_id):
        """
        Harvest details of collections from a subject, or sub-category page.
        """
        collection_ids = []
        # Get the subject page
        response = requests.get(f"http://pandora.nla.gov.au{subject_id}")
        soup = BeautifulSoup(response.text, "lxml")
        # Get all of the links to collection pages
        collection_links = soup.find_all("a", href=re.compile(r"/col/\d+$"))
        # Process each collection page
        for coll_link in collection_links:
            coll_name = " ".join(coll_link.stripped_strings)
            coll_id = coll_link["href"]
            # Get any sub-collections
            subcollection_ids = self.harvest_subcollections(coll_id, coll_name)
            # Get titles
            title_ids = self.get_title_ids(coll_id)
            with Path(self.collection_output).open("a") as collections_file:
                collections_file.write(
                    json.dumps(
                        {
                            "name": coll_name,
                            "id": coll_id,
                            "subcollections": subcollection_ids,
                            "titles": title_ids,
                        }
                    )
                    + "\n"
                )
            collection_ids.append(coll_id)
        return collection_ids

    def harvest(self):
        """
        Start the harvest by getting the top-level subjects on the Pandora home page
        and work down the hierarchy from there.
        """
        # Remove old data files
        Path(self.subject_output).unlink(missing_ok=True)
        Path(self.collection_output).unlink(missing_ok=True)
        # Get the Pandora home page
        response = requests.get("http://pandora.nla.gov.au/")
        soup = BeautifulSoup(response.text, "lxml")
        # Find the list of subjects
        subject_list = soup.find("div", class_="browseSubjects").find_all("li")
        # Process each top-level subject
        for subject in tqdm(subject_list[: self.sample]):
            subject_link = subject.find("a")
            subject_name = " ".join(subject_link.stripped_strings)
            subject_id = subject_link["href"]
            # Get subcategories
            subcategory_ids = self.harvest_subcategories(subject_id)
            # Get collections
            subcollection_ids = self.harvest_collections(subject_id)
            # Get titles
            title_ids = self.get_title_ids(subject_id)
            with Path(self.subject_output).open("a") as subjects_file:
                subjects_file.write(
                    json.dumps(
                        {
                            "name": subject_name,
                            "id": subject_id,
                            "subcategories": subcategory_ids,
                            "collections": subcollection_ids,
                            "titles": title_ids,
                        }
                    )
                    + "\n"
                )
In [ ]:
harvester = SubjectHarvester()
harvester.harvest()

Remove duplicate collections¶

Collections can appear under multiple subjects, so there will be duplicates in the collections dataset.

In [ ]:
dfc = pd.read_json("pandora-collections.ndjson", lines=True)
In [ ]:
dfc.shape
In [ ]:
dfc.drop_duplicates(subset=["id"], inplace=True)
In [ ]:
dfc.shape
In [ ]:
dfc.to_json("pandora-collections.ndjson", orient="records", lines=True)
In [ ]:
# IGNORE CELL --TESTING ONLY
if os.getenv("GW_STATUS") == "dev":

    harvester = SubjectHarvester(
        subject_output="pandora-subjects-test.ndjson",
        collection_output="pandora-collections-test.ndjson",
        sample=1,
    )
    harvester.harvest()

    Path("pandora-subjects-test.ndjson").unlink(missing_ok=True)
    Path("pandora-collections-test.ndjson").unlink(missing_ok=True)

Created by Tim Sherratt for the GLAM Workbench.