Skip to content

Extract text from PDF images using Tesseract

This notebook uses Tesseract (OCR) to extract text directly from the images in the Tasmanian Post Office Directory PDFs.

Although I was able to extract text from the PDFs directly, I wasn't happy with the quality. In particular, column layout detection was quite variable, munging values from different columns together. After a few tests, I decided that re-OCRing the images using Tesseract would produce better results. Tesseract's automatic page layout detection does a pretty good job of identifying the columns, and the OCR quality in general seems better. There's still some munging of values across columns and various other errors, but I think the quality is good enough for searching.

Run live on ARDC Binder

Other options

Additional documentation

Getting help

Cite as

Sherratt, Tim. (2022). GLAM-Workbench/libraries-tasmania (version v1.0.0). Zenodo. https://doi.org/10.5281/zenodo.7080837