Posted by virantha on Sat 20 April 2013

Python auto sort of OCR'ed PDFs

I'd previously written about how I was using a Fujitsu ScanSnap 1500 to reduce paper clutter and move to a paperless workflow at home.  So far, this system has been working great for me, with every scanned document getting OCR'ed and uploaded to my default Evernote notebook as a searchable PDF.  However, I realized that I was still spending more time than I wanted to in manually sorting these documents into Evernote notebooks.

It seemed silly that there was no way for Evernote to automatically sort pdfs based on some kind of tag or keyword search.  So I ended up spending a few hours developing the Python script described below to do this sorting for automatically.  At some point, I'll package this up on github or bitbucket.

I'm calling this program scanever because I didn't have more than 30 seconds to come up with something besides pdf2evernote.  It replaces the kludgy batchfile and watch4folder I had used previously, and is OS-independent.

Use

scanever takes a configuration file that lists all the folders I want to sort into, as well as the keywords that fall into each folder.  Here's an example (it uses the YAML syntax):

watch_folder: "M:/Incoming"
evernote_folder: "M:/To Evernote"
default_folder: default
folders:
   finances:
     - chase visa
     - american express
     - bank of america
   home:
     - mortgage
     - property tax
     - city of oz
   health:
     - explanation of benefits
     - healthcare

The folders section defines a list of folders and their associated keywords. For example, any PDF that gets placed in "M:/Incoming" by Abbyy FineReader that has "chase visa" or "american express" or "bank of america" in its first page gets filed into a subfolder of "M:/To Evernote" called "finances". The default_folder sets where the pdf goes if there is no match found.

So how does my script work? The main logic is divided into three classes:

  • Main class that reads in the configuration file and runs everything
  • Folder watching logic
  • File matching and filing logic

Watching the incoming folder

For this, I'm relying on the excellent cross-platform package watchdog. Here's my custom event handler class that I use to interface with this API, and the way it works is as follows:

  1. On any file created/modified/moved event, it will check if the file ends with .pdf. (This is the first pdf that the ScanSnap does before it sends it off to Abbyy for OCR'ing). We add this filename to the events list (inside check_for_new_pdf)if not already present.
  2. Abbyy FineReader at some point will finish writing the _OCR.pdf file and delete the original .pdf. We key off this delete event and start our analysis of the final _OCR file, described in the next section.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
class ChangeHandler(FileSystemEventHandler):
    events = []
    def __init__(self, pdfsearcher):
        FileSystemEventHandler.__init__(self)
        self.pdfsearcher = pdfsearcher

    def check_for_new_pdf(self,ev_path):
        if ev_path.endswith(".pdf"):
            if not ev_path.endswith("_OCR.pdf"):
                if not ev_path in ChangeHandler.events:
                    ChangeHandler.events.append(ev_path)
                    print "Adding %s to event queue" % ev_path
                else:
                    print "%s alread in event queue" % ev_path
    def on_created(self, event):
    print ("on_created: %s" % event.src_path)
        self.check_for_new_pdf(event.src_path)

    def on_moved(self, event):
    print ("on_moved: %s" % event.src_path)
        self.check_for_new_pdf(event.dest_path)

    def on_modified(self, event):
    print ("on_modified: %s" % event.src_path)
        self.check_for_new_pdf(event.src_path)

    def on_deleted(self,event):
    print ("on_deleted: %s" % event.src_path)
        ev_src_path = event.src_path
        if ev_src_path in ChangeHandler.events:
            print "Deleting %s in event queue" % ev_src_path
            ChangeHandler.events.remove(ev_src_path)
            # Now, check that the OCR version is present
            ocr_path = ev_src_path.replace(".pdf", "_OCR.pdf")
            if os.path.exists(ocr_path):
                print "Analyzing OCR'ed file %s!" % ocr_path
                pdf = self.pdfsearcher
                text = pdf.readPdfFirstPage(ocr_path)
                pdf.moveToFolders()

Analyzing the PDF and sorting it

The next step is to read in the PDF (the first page) and search for matching text. For this, I rely on PyPDF2

class PdfSearcher(object):
    def __init__(self, evernote, default):
        self.folderTargets = {}
        self.pdfText = ""
        self.evernote_folder = evernote
        self.default_folder = default

    def readPdfFirstPage(self, filename):
        self.filename = filename
        reader = PdfFileReader(filename)
        text = reader.getPage(0).extractText()
        text = text.encode('ascii', 'ignore')
        self.pdfText = text
        return text

    def addFolderTarget(self, dirname, matchStrings):
        # Used externally to add in the keywords/folders
        assert dirname not in self.folderTargets, "Target folder already defined! (%s)" % (dirname)
        self.folderTargets[dirname] = matchStrings

    def _getMatchingFolder(self):
        # Return the folder that matches any of the keywords in self.pdfText
        searchText = self.pdfText.lower()
        for folder,strings in self.folderTargets.items():
            for s in strings:
                if s in searchText:
                    print s
                    return folder
        # No match found, so return
        return None

    def moveToFolders(self):
        tgt_folder = self._getMatchingFolder()
        if not tgt_folder:
            logging.debug("No match found, using default folder")
            tgt_path = os.path.join(self.evernote_folder, self.default_folder)
        else:
            tgt_path = os.path.join(self.evernote_folder,tgt_folder)
        if not os.path.exists(tgt_path):
            os.makedirs(tgt_path)
            logging.debug("Making path %s" % tgt_path)
        shutil.move(self.filename, tgt_path)

Main class

Now, here's the main class that reads the configuration and sets up the event loop:

class ScanEver(object):

    def __init__ (self):
        self.maxlength = 500

    def getOptions(self, argv):
        usage = 'ScanEver '
        p = OptionParser(usage)

        p.add_option('-d', '--debug', action='store_true',
            default=False, dest='debug', help='Turn on debugging')

        p.add_option('-v', '--verbose', action='store_true',
            default=False, dest='verbose', help='Turn on verbose mode')

        (opt, args) = p.parse_args(argv)

        self.debug = opt.debug
        self.verbose = opt.verbose

        if opt.debug:
            logging.basicConfig(level=logging.DEBUG, format='%(message)s')

        if opt.verbose:
            logging.basicConfig(level=logging.INFO, format='%(message)s')

        fstream = file("paths.yaml", "r")
        myopts = yaml.load(fstream)
        self.evernote_folder = myopts['evernote_folder']
        self.default_folder = myopts['default_folder']
        self.searcher = PdfSearcher(self.evernote_folder, self.default_folder)

        for folder,strings in myopts["folders"].items():
            self.searcher.addFolderTarget(folder,strings)

    def monitor(self):
        while True:
            event_handler = ChangeHandler(self.searcher)
            observer = Observer()
            #observer.schedule(event_handler, os.path.abspath("test"))
            observer.schedule(event_handler, os.path.abspath("""M:\Incoming"""))
            observer.start()
            try:
                while True:
                    time.sleep(1)
            except KeyboardInterrupt:
                observer.stop()
            observer.join()

    def go(self, argv):
        # Read the command line options
        self.getOptions(argv)
        self.monitor()

if __name__ == '__main__':
    script = ScanEver()
    script.go(sys.argv[1:])

© Virantha Ekanayake. Built using Pelican. Modified svbhack theme, based on theme by Carey Metcalfe