I'd previously written about how I was using a Fujitsu ScanSnap 1500 to reduce paper clutter and move to a paperless workflow at home. So far, this system has been working great for me, with every scanned document getting OCR'ed and uploaded to my default Evernote notebook as a searchable PDF. However, I realized that I was still spending more time than I wanted to in manually sorting these documents into Evernote notebooks.
It seemed silly that there was no way for Evernote to automatically sort pdfs based on some kind of tag or keyword search. So I ended up spending a few hours developing the Python script described below to do this sorting for automatically. At some point, I'll package this up on github or bitbucket.
I'm calling this program scanever because I didn't have more than 30 seconds to come up with something besides pdf2evernote. It replaces the kludgy batchfile and watch4folder I had used previously, and is OS-independent.
Use
scanever takes a configuration file that lists all the folders I want to sort into, as well as the keywords that fall into each folder. Here's an example (it uses the YAML syntax):
watch_folder: "M:/Incoming"
evernote_folder: "M:/To Evernote"
default_folder: default
folders:
finances:
- chase visa
- american express
- bank of america
home:
- mortgage
- property tax
- city of oz
health:
- explanation of benefits
- healthcare
The folders section defines a list of folders and their associated keywords. For example, any PDF that gets placed in "M:/Incoming" by Abbyy FineReader that has "chase visa" or "american express" or "bank of america" in its first page gets filed into a subfolder of "M:/To Evernote" called "finances". The default_folder sets where the pdf goes if there is no match found.
So how does my script work? The main logic is divided into three classes:
- Main class that reads in the configuration file and runs everything
- Folder watching logic
- File matching and filing logic
Watching the incoming folder
For this, I'm relying on the excellent cross-platform package watchdog. Here's my custom event handler class that I use to interface with this API, and the way it works is as follows:
- On any file created/modified/moved event, it will check if the file ends with .pdf. (This is the first pdf that the ScanSnap does before it sends it off to Abbyy for OCR'ing). We add this filename to the events list (inside check_for_new_pdf)if not already present.
- Abbyy FineReader at some point will finish writing the _OCR.pdf file and delete the original .pdf. We key off this delete event and start our analysis of the final _OCR file, described in the next section.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | class ChangeHandler(FileSystemEventHandler): events = [] def __init__(self, pdfsearcher): FileSystemEventHandler.__init__(self) self.pdfsearcher = pdfsearcher def check_for_new_pdf(self,ev_path): if ev_path.endswith(".pdf"): if not ev_path.endswith("_OCR.pdf"): if not ev_path in ChangeHandler.events: ChangeHandler.events.append(ev_path) print "Adding %s to event queue" % ev_path else: print "%s alread in event queue" % ev_path def on_created(self, event): print ("on_created: %s" % event.src_path) self.check_for_new_pdf(event.src_path) def on_moved(self, event): print ("on_moved: %s" % event.src_path) self.check_for_new_pdf(event.dest_path) def on_modified(self, event): print ("on_modified: %s" % event.src_path) self.check_for_new_pdf(event.src_path) def on_deleted(self,event): print ("on_deleted: %s" % event.src_path) ev_src_path = event.src_path if ev_src_path in ChangeHandler.events: print "Deleting %s in event queue" % ev_src_path ChangeHandler.events.remove(ev_src_path) # Now, check that the OCR version is present ocr_path = ev_src_path.replace(".pdf", "_OCR.pdf") if os.path.exists(ocr_path): print "Analyzing OCR'ed file %s!" % ocr_path pdf = self.pdfsearcher text = pdf.readPdfFirstPage(ocr_path) pdf.moveToFolders() |
Analyzing the PDF and sorting it
The next step is to read in the PDF (the first page) and search for matching text. For this, I rely on PyPDF2
class PdfSearcher(object):
def __init__(self, evernote, default):
self.folderTargets = {}
self.pdfText = ""
self.evernote_folder = evernote
self.default_folder = default
def readPdfFirstPage(self, filename):
self.filename = filename
reader = PdfFileReader(filename)
text = reader.getPage(0).extractText()
text = text.encode('ascii', 'ignore')
self.pdfText = text
return text
def addFolderTarget(self, dirname, matchStrings):
# Used externally to add in the keywords/folders
assert dirname not in self.folderTargets, "Target folder already defined! (%s)" % (dirname)
self.folderTargets[dirname] = matchStrings
def _getMatchingFolder(self):
# Return the folder that matches any of the keywords in self.pdfText
searchText = self.pdfText.lower()
for folder,strings in self.folderTargets.items():
for s in strings:
if s in searchText:
print s
return folder
# No match found, so return
return None
def moveToFolders(self):
tgt_folder = self._getMatchingFolder()
if not tgt_folder:
logging.debug("No match found, using default folder")
tgt_path = os.path.join(self.evernote_folder, self.default_folder)
else:
tgt_path = os.path.join(self.evernote_folder,tgt_folder)
if not os.path.exists(tgt_path):
os.makedirs(tgt_path)
logging.debug("Making path %s" % tgt_path)
shutil.move(self.filename, tgt_path)
Main class
Now, here's the main class that reads the configuration and sets up the event loop:
class ScanEver(object):
def __init__ (self):
self.maxlength = 500
def getOptions(self, argv):
usage = 'ScanEver '
p = OptionParser(usage)
p.add_option('-d', '--debug', action='store_true',
default=False, dest='debug', help='Turn on debugging')
p.add_option('-v', '--verbose', action='store_true',
default=False, dest='verbose', help='Turn on verbose mode')
(opt, args) = p.parse_args(argv)
self.debug = opt.debug
self.verbose = opt.verbose
if opt.debug:
logging.basicConfig(level=logging.DEBUG, format='%(message)s')
if opt.verbose:
logging.basicConfig(level=logging.INFO, format='%(message)s')
fstream = file("paths.yaml", "r")
myopts = yaml.load(fstream)
self.evernote_folder = myopts['evernote_folder']
self.default_folder = myopts['default_folder']
self.searcher = PdfSearcher(self.evernote_folder, self.default_folder)
for folder,strings in myopts["folders"].items():
self.searcher.addFolderTarget(folder,strings)
def monitor(self):
while True:
event_handler = ChangeHandler(self.searcher)
observer = Observer()
#observer.schedule(event_handler, os.path.abspath("test"))
observer.schedule(event_handler, os.path.abspath("""M:\Incoming"""))
observer.start()
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
observer.stop()
observer.join()
def go(self, argv):
# Read the command line options
self.getOptions(argv)
self.monitor()
if __name__ == '__main__':
script = ScanEver()
script.go(sys.argv[1:])