Harvest summary data from Trove lists¶

Using the Trove API we'll harvest some information about Trove lists and create a dataset containing the following fields:

  • id — the list identifier, you can use this to get more information about a list from either the web interface or the API
  • title
  • number_items — the number of items in the list
  • created — the date the list was created
  • updated — the date the list was last updated

If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.

Some tips:

  • Code cells have boxes around them.
  • To run a code cell click on the cell and then hit Shift+Enter. The Shift+Enter combo will also move you to the next cell, so it's a quick way to work through the notebook.
  • While a cell is running a * appears in the square brackets next to the cell. Once the cell has finished running the asterix will be replaced with a number.
  • In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.
  • To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.

Setting up...¶

In [ ]:
import datetime
import os
import warnings
from json import JSONDecodeError
from operator import itemgetter

warnings.simplefilter(action="ignore", category=FutureWarning)

import altair as alt
import nltk
import pandas as pd
import requests_cache
from dotenv import load_dotenv
from IPython.display import HTML, display
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from textblob import TextBlob
from tqdm.auto import tqdm
from wordcloud import WordCloud

nltk.download("stopwords")
nltk.download("punkt")

s = requests_cache.CachedSession()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[500, 502, 503, 504])
s.mount("http://", HTTPAdapter(max_retries=retries))
s.mount("https://", HTTPAdapter(max_retries=retries))

load_dotenv()

Add your Trove API key¶

In [19]:
# Insert your Trove API key between the quotes
API_KEY = "YOUR API KEY"

# Use api key value from environment variables if it is available
if os.getenv("TROVE_API_KEY"):
    API_KEY = os.getenv("TROVE_API_KEY")

Set some parameters¶

You could change the value of q if you only want to harvest a subset of lists.

In [20]:
api_url = "https://api.trove.nla.gov.au/v3/result"

params = {
    "category": "list",
    "encoding": "json",
    "n": 100,
    "s": "*",
    "reclevel": "full",
    "bulkHarvest": "true",
}

headers = {"X-API-KEY": API_KEY}

Harvest the data¶

In [21]:
def get_total():
    """
    This will enable us to make a nice progress bar...
    """
    response = s.get(api_url, params=params, headers=headers)
    data = response.json()
    return int(data["category"][0]["records"]["total"])
In [22]:
lists = []
total = get_total()
with tqdm(total=total) as pbar:
    while params["s"]:
        response = s.get(api_url, params=params, headers=headers)
        try:
            data = response.json()
        except JSONDecodeError:
            print(response.text)
            print(response.url)
            raise
        else:
            records = data["category"][0]["records"]
            try:
                params["s"] = records["nextStart"]
            except KeyError:
                params["s"] = None
            for record in records["list"]:
                try:
                    lists.append(
                        {
                            "id": record["id"],
                            "title": record.get("title", ""),
                            "number_items": record["listItemCount"],
                            "created": record["date"]["created"],
                            "updated": record["date"]["lastupdated"],
                        }
                    )
                except TypeError:
                    print(record)
            pbar.update(100)
  0%|          | 0/111965 [00:00<?, ?it/s]
None
None
None
None
None

Inspect the results¶

In [5]:
# Load past file for testing if in dev
if os.getenv("GW_STATUS") and os.getenv("GW_STATUS") == "dev":
    df = pd.read_csv("data/trove-lists-2024-05-29.csv")
# Otherwise load current harvested data
else:
    df = pd.DataFrame(lists)
    df.head()
In [6]:
df.describe()
Out[6]:
id number_items
count 111960.000000 111960.000000
mean 89844.496008 19.163433
std 50142.898174 83.319781
min 51.000000 0.000000
25% 47114.500000 1.000000
50% 90193.500000 4.000000
75% 132493.750000 13.000000
max 179448.000000 10351.000000

Save the harvested data as a CSV file¶

In [ ]:
csv_file = "data/trove-lists-{}.csv".format(datetime.datetime.now().isoformat()[:10])
df.to_csv(csv_file, index=False)
HTML('<a target="_blank" href="{}">Download CSV</a>'.format(csv_file))

How many items are in lists?¶

In [7]:
total_items = df["number_items"].sum()
print("There are {:,} items in {:,} lists.".format(total_items, df.shape[0]))
There are 2,145,538 items in 111,960 lists.

What is the biggest list?¶

In [8]:
biggest = df.iloc[df["number_items"].idxmax()]
biggest
Out[8]:
id                                  71461
title           Victoria and elsewhere...
number_items                        10351
created              2015-04-03T11:50:51Z
updated              2016-02-22T04:27:12Z
Name: 91223, dtype: object
In [9]:
display(
    HTML(
        'The biggest list is <a target="_blank" href="https://trove.nla.gov.au/list?id={}">{}</a> with {:,} items.'.format(
            biggest["id"], biggest["title"], biggest["number_items"]
        )
    )
)
The biggest list is Victoria and elsewhere... with 10,351 items.

When were they created?¶

In [10]:
# This makes it possible to include more than 5000 records
# alt.data_transformers.enable('json', urlpath='files')
alt.data_transformers.disable_max_rows()
alt.Chart(df[["created"]]).mark_line().encode(
    x="yearmonth(created):T",
    y="count()",
    tooltip=[
        alt.Tooltip("yearmonth(created):T", title="Month"),
        alt.Tooltip("count()", title="Lists"),
    ],
).properties(width=600)
Out[10]:

What words are used in the titles?¶

In [11]:
titles = df["title"].str.lower().str.cat(sep=" ")
In [12]:
# Generate a word cloud image
wordcloud = WordCloud(width=1200, height=800).generate(titles)
wordcloud.to_image()
Out[12]:
No description has been provided for this image

Word frequency¶

In [13]:
blob = TextBlob(titles)
stopwords = nltk.corpus.stopwords.words("english")
word_counts = [
    [word, count]
    for word, count in blob.lower().word_counts.items()
    if word not in stopwords
]
word_counts = sorted(word_counts, key=itemgetter(1), reverse=True)[:25]
pd.DataFrame(word_counts).style.format({1: "{:,}"}).bar(
    subset=[1], color="#d65f5f"
).set_properties(subset=[1], **{"width": "300px"})
Out[13]:
  0 1
0 family 7,377
1 list 4,358
2 ww1 4,333
3 soldier 4,303
4 articles 4,214
5 trove 3,962
6 john 2,918
7 william 2,723
8 history 2,419
9 james 1,962
10 george 1,639
11 thomas 1,586
12 henry 1,397
13 australian 1,198
14 australia 1,161
15 charles 1,127
16 mary 1,038
17 nsw 894
18 edward 892
19 nee 867
20 ww2 840
21 robert 834
22 joseph 780
23 nt 765
24 arthur 764

Bigram frequency¶

In [14]:
ngrams = [" ".join(ngram).lower() for ngram in blob.lower().ngrams(2)]
ngram_counts = (
    pd.DataFrame(ngrams)[0]
    .value_counts()
    .rename_axis("ngram")
    .reset_index(name="count")
)
display(
    ngram_counts[:25]
    .style.format({"count": "{:,}"})
    .bar(subset=["count"], color="#d65f5f")
    .set_properties(subset=["count"], **{"width": "300px"})
)
  ngram count
0 ww1 soldier 3,958
1 list of 3,886
2 of articles 3,858
3 soldier list 3,847
4 in trove 3,756
5 articles in 3,737
6 family history 1,103
7 nt ww2 725
8 family tree 367
9 of the 357
10 in australia 319
11 in the 318
12 wwi soldier 271
13 family of 255
14 south australia 232
15 william ww1 221
16 port lincoln 209
17 henry ww1 194
18 john ww1 182
19 and the 177
20 maroochydore slsc 175
21 world war 171
22 james ww1 161
23 mary ann 160
24 motor boat 153

Trigram frequency¶

In [15]:
ngrams = [" ".join(ngram).lower() for ngram in blob.lower().ngrams(3)]
ngram_counts = (
    pd.DataFrame(ngrams)[0]
    .value_counts()
    .rename_axis("ngram")
    .reset_index(name="count")
)
display(
    ngram_counts[:25]
    .style.format({"count": "{:,}"})
    .bar(subset=["count"], color="#d65f5f")
    .set_properties(subset=["count"], **{"width": "300px"})
)
  ngram count
0 list of articles 3,847
1 soldier list of 3,840
2 articles in trove 3,728
3 of articles in 3,721
4 ww1 soldier list 3,563
5 wwi soldier list 266
6 william ww1 soldier 219
7 henry ww1 soldier 191
8 john ww1 soldier 180
9 james ww1 soldier 160
10 george ww1 soldier 150
11 charles ww1 soldier 133
12 joseph ww1 soldier 124
13 edward ww1 soldier 123
14 of articles on 118
15 articles on trove 117
16 thomas ww1 soldier 115
17 australian gymnastics research 109
18 andrews of albury 106
19 cocker spaniel affix 105
20 arthur andrews of 100
21 dr arthur andrews 100
22 ww1 trophy guns 92
23 music resources theme 79
24 robert ww1 soldier 73

Created by Tim Sherratt for the GLAM Workbench.