OCRd text from Trove digitised journals
This dataset contains OCRd text and metadata harvested from digitised periodicals in Trove.
The zip file contains a directory for each periodical which is named using it's title and identifier, eg: 14th-company-magazine-nla.obj-15956697. Each directory contains a CSV-formatted list of issues and a subdirectory named texts that contains a text file for each issue with OCRd text, The text files are named using the issue's date and identifier, eg: 1918-06-14-nla.obj-15967449.txt. If text was successfully downloaded from an issue, the issues.csv file will inlcude the name of the text file in the text_file column.
Files¶
trove-periodicals.zip¶
| date harvested | 2024-03-12 |
| format | application/zip |
| file size | 3.7 GB |
Context of creation¶
| date harvested | 2024-03-12 |
Getting help¶
Cite as¶
Sherratt, Tim. (2024). GLAM-Workbench/trove-journals (version v2.2.0). Zenodo. https://doi.org/10.5281/zenodo.13744407