Due to being a responsible human being, I've amassed quite a large amount of important paper documents ranging from receipts and bills to contracts and passports. Naturally the longer I live the larger this hoard of documents becomes. Documents are the last thing I want to hoard, so I set out with some way to digitize the documents. That way I can leverage OCR as well as offsite backups to ensure my documents are always readable, organized, searchable, and redundantly backed up.
Easily the two top choices are Mayan EDMS and The Paperless Project. Both are complete opposite of each other:
- Mayan EDMS runs on python 2 in 2019 (until release 3.3 is out), paperless runs on python 3.
- Mayan EDMS is constantly being worked on and new releases are being cut, while the paperless project is still run by (mostly) one guy and hasn't had any commits to master in ~6 months
- Paperless is more geared towards single user environments and homelabbers, and Mayan EDMS is a fully fledged document management system used by companies and enterprises
- Paperless does document transformations/cleanup for you in a predefined way, while Mayan EDMS lets you fully customize your document transformations
- Both support importing document scans via a multitude of ways: monitoring a folder, NFS/SMB share, email
- Both use tessaract for OCR
Result: Mayan EDMS
In the end I decided that a well maintained and actively developed product was worth more than one that is more geared towards my use case but is semi-abandoned. Mayan EDMS has a ridiculous amount of configurations, and its definitely overwhelming at first. After a while of researching/reading documentation, you get to a point where you're semi-comfortable with it. I've found that any questions I had could be easily asked via the Mayan EDMS gitlab page, and is usually answered within a day.
Once I became semi familiar with Mayan EDMS, I drafted up a basic workflow/configuration:
- I configured 4 document types: bills, receipts, documents, and uploads
- I setup a staging folder document upload and pointed it to the folder where all my document scans are
- I “upload” a new document by choosing a document category and selecting a document that matches the selected category
- Once “uploaded”, the file is submitted for OCR scanning and is added to my database
- The file is then deleted from the staging folder
In the end I had roughly 1200 documents, and sorting through and categorizing them was painful. The OCR scanning could definitely use some improvement, but thats mostly due to the documents I scanned not being optimal for OCR scanning. Overall I am satisifed with the workflow, and will be improving on it in the future. I'm definitely excited for the python 3 release of Mayan EDMS, that will definitely check a box on my list.