Harvest public tags from Trove zones¶

This notebook harvests all the public tags that users have added to records in Trove. However, tags are being added all the time, so by the time you've finished harvesting, the dataset will probably be out of date.

You can access tags via the API by adding has:tags to the query parameter to limit results to records with tags, and then adding the include=tags parameter to include the tag data in each item record.

The harvest_tags() function harvests all tags from the specified zone and writes them to a CSV file named according to the zone, for example, tags_newspaper.csv.

Each CSV file contains the following columns:

  • tag – the text tag
  • date – date the tag was added
  • zone – the Trove API zone (eg 'newspaper', 'book')
  • record_id – the id of the record to which the tag has been added

Once the zone harvests are complete you can use this notebook to combine the separate CSV files, normalise the capitalisation of tags, and save the complete results into a single CSV file.

Some things to note:

  • Works (like books) can have tags attached at either work or version level. To simplify things, this code aggregates all tags at the work level, removing any duplicates.
  • A single resource in Trove can appear in multiple zones – for example, a book that includes maps and illustrations might appear in the 'book', 'picture', and 'map' zones. This means that some of the tags will essentially be duplicates – harvested from different zones, but relating to the same resource.
  • User content added to Trove, including tags, is available for reuse under a CC-BY-NC licence.

The complete dataset created by this notebook is available for download from Zenodo.

For some examples of how you might analyse and visualise the harvested tags, see this notebook.

This notebook has not been updated to work with the Trove API v3 because as of June 2024 there remains a problem in the data that causes a bulk harvest using v3 to fail.

In [1]:
import csv
import os
import time
from pathlib import Path

import pandas as pd
import requests_cache
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from tqdm.auto import tqdm

s = requests_cache.CachedSession()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[500, 502, 503, 504])
s.mount("http://", HTTPAdapter(max_retries=retries))
s.mount("https://", HTTPAdapter(max_retries=retries))
In [2]:
%%capture
# Load variables from the .env file if it exists
# Use %%capture to suppress messages
%load_ext dotenv
%dotenv
In [3]:
# Insert your Trove API key between the quotes
API_KEY = "YOUR API KEY"

# Use api key value from environment variables if it is available
if os.getenv("TROVE_API_KEY"):
    API_KEY = os.getenv("TROVE_API_KEY")
In [4]:
api_url = "http://api.trove.nla.gov.au/v2/result"

# Set basic parameters
params = {
    "q": "has:tags",
    "include": "tags",
    "encoding": "json",
    "bulkHarvest": "true",
    "n": 100,
    "key": API_KEY,
}

# These types are needed to get data from API results
record_types = {
    "newspaper": "article",
    "gazette": "article",
    "book": "work",
    "article": "work",
    "picture": "work",
    "music": "work",
    "map": "work",
    "collection": "work",
    "list": "list",
}
In [5]:
def get_total(cparams):
    """
    This will enable us to make a nice progress bar...
    """
    response = s.get(api_url, params=cparams)
    data = response.json()
    return int(data["response"]["zone"][0]["records"]["total"])


def get_tags_from_record(record):
    """
    Extract tags from the supplied record.
    Returns a list of tags.
    Each tag is a list with two elements – value and date.
    """
    tags = []
    try:
        for tag in record["tag"]:
            tag_data = [tag.get("value"), tag.get("lastupdated")]
            tags.append(tag_data)
    except KeyError:
        pass
    return tags


def harvest_tags(zone):
    """
    Harvest public tags from the specified zone.
    Results are written to a CSV file.
    """
    print(zone)
    # article, work, or list
    record_type = record_types[zone]
    # Delete existing data file
    Path(f"tags_{zone}.csv").unlink(missing_ok=True)
    # Write column headings
    with Path(f"tags_{zone}.csv").open("a") as tag_file:
        writer = csv.writer(tag_file)
        writer.writerow(["tag", "date", "zone", "record_id"])
    start = "*"
    cparams = params.copy()
    cparams["zone"] = zone
    # If it's a work, get versions as well
    if record_type == "work":
        cparams["include"] = "tags,workversions"
    total = get_total(cparams)
    with tqdm(total=total) as pbar:
        while start is not None:
            cparams["s"] = start
            response = s.get(api_url, params=cparams)
            data = response.json()
            results = data["response"]["zone"][0]["records"]
            # Get token for next page
            try:
                start = results["nextStart"]
            # End of the result set
            except KeyError:
                start = None
            with Path(f"tags_{zone}.csv").open("a") as tag_file:
                writer = csv.writer(tag_file)
                for record in results[record_type]:
                    tags = []
                    tags += get_tags_from_record(record)
                    # If there are versions loop through them gathering tags
                    if "version" in record:
                        for version in record["version"]:
                            tags += get_tags_from_record(version)
                    # Remove duplicate tags on work
                    tags = [list(t) for t in {tuple(tl) for tl in tags}]
                    #
                    if len(tags) == 0:
                        print(record)
                    # Add zone and record_id, then write to CSV
                    for tag in tags:
                        tag.append(zone)
                        tag.append(record["id"])
                        writer.writerow(tag)
            pbar.update(len(results[record_type]))
            if not response.from_cache:
                time.sleep(0.2)
In [ ]:
for zone in [
    "newspaper",
    "gazette",
    "book",
    "article",
    "picture",
    "music",
    "map",
    "collection",
    "list",
]:
    harvest_tags(zone)

Combine the tag files and convert to a dataframe¶

In [9]:
dfs = []
for zone in [
    "newspaper",
    "gazette",
    "book",
    "article",
    "picture",
    "music",
    "map",
    "collection",
    "list",
]:
    dfs.append(pd.read_csv(f"tags_{zone}.csv"))
df = pd.concat(dfs)
df.head()
Out[9]:
tag date zone record_id
0 TCCC 2024-03-26T23:22:30Z newspaper 1000000
1 TCCC 2024-03-26T23:32:50Z newspaper 100000001
2 Stephen Guihen 2013-03-24T02:30:11Z newspaper 100000011
3 test 22/6/23 @ 9:09am 2023-06-21T23:09:34Z newspaper 100000068
4 HICKEN Aberaham - Barellan 2019-12-03T23:02:10Z newspaper 100000071

How many tagged items are there? (Note that this could include duplicates where items are available in multiple zones.)

In [10]:
df.shape
Out[10]:
(10403650, 4)

How many unique tags are there?

In [11]:
df["tag"].nunique()
Out[11]:
2495958

Normalise capitalisation and save as CSV¶

Cases are mixed in tags, although tag search in Trove is case-insensitive. Here we'll convert all the tags to lower-case, so we can aggregate them.

In [12]:
df["tag_normalised"] = df["tag"].str.lower()

To keep things compact, we'll drop the mixed-case tags and rename the new column.

In [13]:
# Remove the unnormalised tag column
df.drop(columns="tag", inplace=True)
# Rename the lowercase tag column
df.rename(columns={"tag_normalised": "tag"}, inplace=True)

Now let's save the complete, normalised dataset to a single CSV file.

In [14]:
# Reorder columns and save as CSV
df[["tag", "date", "zone", "record_id"]].to_csv("trove_tags_20240606.csv", index=False)

Created by Tim Sherratt for the GLAM Workbench. Support this project by becoming a GitHub sponsor.