Create archived url datasets from Pandora's collections and subjects¶
The Australian Web Archive makes billions of archived web pages searchable through Trove. But how would you go about constructing a search to find websites relating to Australian election campaigns? Fortunately you don't have to, as Pandora provides a collection of archived web resources organised by subject and collection – including thousands of sites about elections. This notebook makes it easy to save details of all the archived websites under any heading in Pandora's subject hierarchy, creating custom datasets relating to specific topics or events.
For convenience, this notebook uses pre-harvested datasets containing information about Pandora's subjects, collections and titles. New titles are added to Pandora frequently, so you might want to create your own updated versions using these notebooks:
Using this notebook¶
The simplest way to get started is to browse the subject and collection groupings in Pandora. Once you've found a subject or collection of interest, just copy its identifier, either /subject/[subject number]
or /col/[collection number]
. I've also created a single-page version of the complete subject hierarchy which makes it a bit easier to see what's included under each level.
Titles can be linked to any level in this hierarchy. To assemble a complete list of titles under a subject such as 'Arts', for example, you need to get all the titles from 'Arts', all of the titles from all of the sub-categories under 'Arts', and all of the titles from all of the collections and sub-collections under both 'Arts' and its subcategories. So when you create your dataset you need to decide if you want every title under that subject or collection, including those associated with its children, or if you only want the titles directly linked to your selected heading.
You can then run either create_subject_dataset([your subject id])
or create_collection_dataset([your collection id])
in the cells below.
If you want to include titles from any child categories or collections, set the include_subcategories
and include_collections
parameters to True
.
For example:
create_subject_dataset("/subject/6", include_collections=True)
will generate a dataset that contains every archived url under the 'Elections' heading, including urls in child collections.
Datasets¶
This notebook creates a CSV formatted dataset containing the following fields:
tep_id
– the Title Entry Page (TEP) identifier in the form/tep/[TEP NUMBER]
name
– name of the titlegathered_url
– the url that was archivedsurt
– the surt (Sort-friendly URI Reordering Transform) is a version of the url that reverses the order of the domain components to put the top-level domain first, making it easier to group or sort resources by domain
Note that Pandora's title records can bring together different urls and domains that have pointed to a resource over time. This means that there can be multiple urls associated with each TEP. See Harvest the full collection of Pandora titles for more information.
The dataset also includes an RO-Crate metadata file describing the dataset's contents and context.
What can you do with a collection of archived urls?¶
For more information about the Pandora title, use the tep_id
to construct a url to a human-readable version in Trove, or a machine-readable JSON version:
- https://webarchive.nla.gov.au/tep/131444 – goes to TEP web page
- https://webarchive.nla.gov.au/bamboo-service/tep/131444 – returns JSON version of TEP
Once you have an archived url you can make use of the tools in the Web Archives section of the GLAM Workbench to gather more data for analysis. For example:
import os
from datetime import datetime
from pathlib import Path
import pandas as pd
from dotenv import load_dotenv
from IPython.display import HTML, display
from slugify import slugify
import crate_maker
load_dotenv()
True
# Load the pre-harvested datasets
dfc = pd.read_json(
"https://github.com/GLAM-Workbench/trove-web-archives-collections/raw/main/pandora-collections.ndjson",
lines=True,
)
dfs = pd.read_json(
"https://github.com/GLAM-Workbench/trove-web-archives-collections/raw/main/pandora-subjects.ndjson",
lines=True,
)
dft = pd.read_csv(
"https://github.com/GLAM-Workbench/trove-web-archives-titles/raw/main/pandora-titles.csv"
)
Get title urls from a Pandora subject group¶
def get_title_ids_in_collection(coll_id, include_subcollections=True):
"""
Get all the title ids in a collection.
"""
title_ids = []
coll = dfc.loc[dfc["id"] == coll_id].iloc[0]
title_ids += coll["titles"]
if include_subcollections:
for scoll_id in coll["subcollections"]:
scoll = dfc.loc[dfc["id"] == scoll_id].iloc[0]
title_ids += scoll["titles"]
return title_ids
def get_titles_by_subject(
subject, include_subcategories=False, include_collections=False
):
title_ids = []
title_ids += subject["titles"]
if include_subcategories:
for subc_id in subject["subcategories"]:
subc = dfs.loc[dfs["id"] == subc_id].iloc[0]
title_ids += subc["titles"]
if include_collections:
for coll_id in subc["collections"]:
title_ids += get_title_ids_in_collection(coll_id)
if include_collections:
for coll_id in subject["collections"]:
title_ids += get_title_ids_in_collection(coll_id)
titles = dft.loc[dft["tep_id"].isin(title_ids)]
return titles
def create_subject_dataset(
id, include_subcategories=False, include_collections=False, include_crate=True
):
start_date = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
subject = dfs.loc[dfs["id"] == id].iloc[0]
df = get_titles_by_subject(
subject,
include_subcategories=include_subcategories,
include_collections=include_collections,
)
end_date = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
subject_slug = slugify(f"pandora-{id}-{subject['name']}")
output_path = Path("datasets", subject_slug)
output_path.mkdir(exist_ok=True, parents=True)
output_file = Path(output_path, f"pandora-{subject_slug}.csv")
df.to_csv(output_file, index=False)
if include_crate:
crate_maker.create_rocrate(subject, output_file, start_date, end_date)
display(
HTML(
f"Download dataset: <a href='datasets/{subject_slug}.zip', download>datasets/{subject_slug}.zip</a>"
)
)
create_subject_dataset(
"/subject/3", include_subcategories=True, include_collections=True
)
Get title urls from a Pandora collection¶
def get_titles_by_collection(coll, include_subcollections=True):
title_ids = get_title_ids_in_collection(
coll["id"], include_subcollections=include_subcollections
)
titles = dft.loc[dft["tep_id"].isin(title_ids)]
return titles
def create_collection_dataset(id, include_subcollections=False, include_crate=True):
start_date = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
coll = dfc.loc[dfc["id"] == id].iloc[0]
df = get_titles_by_collection(
coll,
include_subcollections=include_subcollections,
)
end_date = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
coll_slug = slugify(f"pandora-{id}-{coll['name']}")
output_path = Path("datasets", coll_slug)
output_path.mkdir(exist_ok=True, parents=True)
output_file = Path(output_path, f"pandora-{coll_slug}.csv")
df.to_csv(output_file, index=False)
if include_crate:
crate_maker.create_rocrate(coll, output_file, start_date, end_date)
display(
HTML(
f"Download dataset: <a href='datasets/{coll_slug}.zip', download>datasets/{coll_slug}.zip</a>"
)
)
create_collection_dataset("/col/21326", include_subcollections=True)
# IGNORE CELL -- TESTING ONLY
if os.getenv("GW_STATUS") == "dev":
create_subject_dataset(
"/subject/3",
include_subcategories=True,
include_collections=True,
include_crate=False,
)
create_collection_dataset(
"/col/21326", include_subcollections=True, include_crate=False
)
Created by Tim Sherratt for the GLAM Workbench.