Extract text from PDF images using Tesseract
This notebook uses Tesseract (OCR) to extract text directly from the images in the Tasmanian Post Office Directory PDFs.
Although I was able to extract text from the PDFs directly, I wasn't happy with the quality. In particular, column layout detection was quite variable, munging values from different columns together. After a few tests, I decided that re-OCRing the images using Tesseract would produce better results. Tesseract's automatic page layout detection does a pretty good job of identifying the columns, and the OCR quality in general seems better. There's still some munging of values across columns and various other errors, but I think the quality is good enough for searching.
Other options¶
- Run live on Binder (no authentication required)
- Download from GitHub
- View using NBViewer
Additional documentation¶
Getting help¶
Cite as¶
Sherratt, Tim. (2022). GLAM-Workbench/libraries-tasmania (version v1.0.0). Zenodo. https://doi.org/10.5281/zenodo.7080837