Updates:
- Anytime - Get the latest updates on PyPDFOCR
- 10/28/13 - Adds uploading to Evernote notebooks based on keywords!
- 10/25/13 - Supports filing to directories based on keyword search
- 10/22/13 - Now on PyPI, so you can just do "pip install pypdfocr"! (For windows, I still recommend downloading my prebuilt .exe as written below)
- 10/21/13 - Script can watch a directory for new pdfs and automatically run ocr on them!
On a previous post, I've discussed how I've become a fan of Fujitsu's ScanSnap device to reduce the pile of paper in my office. I've been wanting to script more of the flow, and the one stumbling block has been the optical character recognition phase that makes the scanned PDF searchable. In this post, I'll detail my experience in using a free OCR engine from HP/Google called Tesseract to handle the PDF OCR conversion.
Tesseract OCR
The ScanSnap ships with a free version of Abbyy FineReader, an excellent piece of OCR software; unfortunately any scripting ability requires a pricey upgrade to the Pro version. A short search later, I found the most popular open/free solution out there: Tesseract-OCR. I had looked at this a while ago when the text-recognition quality seemed lacking, but version 3.x has improved significantly.
However, simply downloading Tesseract and running it doesn't lead to a very usable solution, as I frustratingly found out. The software only takes image files (like TIFF or JPG) as input, and produces either a text file or a HOCR html file as output. Even a web search did not bring up any ready-built scripts to have Tesseract take a PDF as an input and output the OCR'ed PDF. So, with the help of a HOCR to pdf script I found from google, I wrote up my own script called PyPDFOCR.
Usage
Once you have PyPDFOCR instaled, it's as simple as typing:
python pypdfocr.py filename.pdf
This will generate a corresponding filename_ocr.pdf
Please see the documentation for all the features.
Installation
This script does have a bunch of external dependencies (all free/open source). So far, I've verified it runs on Mac OS X (10.7; probably other versions too) and Windows 7 64-bit.
Please see the documentation for installation instructions.
How Scripting Tesseract for PDF to PDF conversion works
Here's how this script works:
- Using Ghostscript, convert the input PDF into a multi-page tiff image
- Using Ghostscript, convert the input PDF into multiple jpeg images. Required to work around a compression issue in the ReportLab PDF generation.
- Using Tesseract, convert the multi-page tiff into a OCR representation called HOCR (html based open standard on describing every recognized word location on a page)
- Build the output PDF using the multiple jpeg images, while parsing the HOCR file and generating text on each page in an invisible font
Special thanks to the folks at google who wrote hocr-pdf.py (Apache license 2.0)) that showed me how to use the hocr format; I basically only had to add multi-page support for this part of the flow.