Skip to content

Reshaping your newspaper harvest

The Trove Newspaper Harvester downloads the OCRd text of newspaper articles as individual text files – one file for each article. That's great for exploring the content of individual articles in depth, but sometimes you might want to zoom out and aggregate the files into larger chunks. For example, if you're interested in how language changes over time, you might what to create a separate corpus for each year in the results set. Or perhaps you want to examine differences in the way particular newspapers talk about an event by grouping the articles by newspaper. This notebook provides a slice and dice wonder tool for Trove newspaper harvests, enabling you to repackage OCRd text by decade, year, and newspaper title. It saves the results as zip files, concatenated text files, or CSV files with embedded text. These repackaged slices should suit a variety of text analysis tools and questions.

Preview

Expand

Using this notebook

To run this notebook using the ARDC Binder service you'll need to log in using an account from an Australian university or research organisation. If you don't have an account, try MyBinder instead.

Run live on ARDC Binder

The MyBinder service doesn't require any authentication, but it can be slow to start and will sometimes fail when busy. If you have a login at an Australian university, you'll probably get better results with ARDC Binder.

Run live on MyBinder

Binder is great for experimentation and quick tasks, but for some projects you might need a dedicated, persistent environment in which to work. There's information on other options in the run these notebooks section.

Additional documentation

Getting help

Cite as

Sherratt, Tim. (2024). GLAM-Workbench/trove-newspaper-harvester (version v2.1.1). Zenodo. https://doi.org/10.5281/zenodo.11295552