Extract text from PDF images using Tesseract

This notebook uses Tesseract (OCR) to extract text directly from the images in the Tasmanian Post Office Directory PDFs.

Although I was able to extract text from the PDFs directly, I wasn't happy with the quality. In particular, column layout detection was quite variable, munging values from different columns together. After a few tests, I decided that re-OCRing the images using Tesseract would produce better results. Tesseract's automatic page layout detection does a pretty good job of identifying the columns, and the OCR quality in general seems better. There's still some munging of values across columns and various other errors, but I think the quality is good enough for searching.

Run live on ARDC Binder

Other options¶

Run live on Binder (no authentication required)
Download from GitHub
View using NBViewer

Additional documentation¶

Run these notebooks

Getting help¶

Cite as¶

Sherratt, Tim. (2022). GLAM-Workbench/libraries-tasmania (version v1.0.0). Zenodo. https://doi.org/10.5281/zenodo.7080837