Skip to content

Convert a HTML finding aid to JSON

While I think the finding aids are created and stored as EAD encoded XML files, they are delivered as HTML. This means that to reassemble the finding aid hierarchy in a way that facilitates analysis, we have to scrape the HTML and make a few assumptions about the content.

This notebook scrapes data from the HTML of a finding aid, saving the hierarchy of series, sub-series, and items as a list of nested objects. The results can be saved as a JSON file.

Run live on ARDC Binder

Other options

Additional documentation

Getting help

Cite as

Sherratt, Tim. (2023). GLAM-Workbench/trove-unpublished (version v1.0.0). Zenodo. https://doi.org/10.5281/zenodo.7690276