Workflow & Filetype for a Cookbook collection
Posted: 15 Apr 2015, 09:05
After following the boards and doing about 6mo of research, I have just finished assembling an Archivist, installed CHDK on two cameras and installed SpreadPi on a RaspberryPi 2. It actually went faster than I expected so now I'm at the software stage unexpectedly.
My first project with the scanner is to digitize a cookbook collection and run OCR to make them searchable. Some of the older books are text only, but the newer ones often are image & text. Some of the larger ones are 500+ pages (a couple are even 1000+ pages). And like many cookbooks, the text has an intro, then blocks of ingredients and then followed by recipe steps. The new ones have inset boxes or margin printing. So, that is to say the text can be irregular.
I'd like to draw on the community experience to start with an appropriate workflow and filetype for this project.
1) I think my first fork is whether I use pdf or djvu file format. As much as I've read I can see pros and cons to each. PDF seems to create more universal and searchable OCR, but can get get large file sizes and take a long time to deal with. Djvu seems to be great for storing this type of book efficiently, but may not be as good with OCR (?) and requires special software to open. What would some of you recommend as the right choice for a searchable cookbook database?
2) Once I've gotten that, I'm interested in the workflow recommendations. I have an older copy of Adobe Acrobat X on an academic license, but the just released Acrobat DC is pretty accessible with their new subscription model. ($14/mo vs. $500 outlay...) So ClearScan becomes an option. Though Tesseract based OCR seems to be doing great these days too. A workflow of Spreads -> ScanTailor -> Acrobat seemed to be the simplest, but I'm up for a more complicated workflow that gives better results.
So.... thoughts and guidance? Thank you!
My first project with the scanner is to digitize a cookbook collection and run OCR to make them searchable. Some of the older books are text only, but the newer ones often are image & text. Some of the larger ones are 500+ pages (a couple are even 1000+ pages). And like many cookbooks, the text has an intro, then blocks of ingredients and then followed by recipe steps. The new ones have inset boxes or margin printing. So, that is to say the text can be irregular.
I'd like to draw on the community experience to start with an appropriate workflow and filetype for this project.
1) I think my first fork is whether I use pdf or djvu file format. As much as I've read I can see pros and cons to each. PDF seems to create more universal and searchable OCR, but can get get large file sizes and take a long time to deal with. Djvu seems to be great for storing this type of book efficiently, but may not be as good with OCR (?) and requires special software to open. What would some of you recommend as the right choice for a searchable cookbook database?
2) Once I've gotten that, I'm interested in the workflow recommendations. I have an older copy of Adobe Acrobat X on an academic license, but the just released Acrobat DC is pretty accessible with their new subscription model. ($14/mo vs. $500 outlay...) So ClearScan becomes an option. Though Tesseract based OCR seems to be doing great these days too. A workflow of Spreads -> ScanTailor -> Acrobat seemed to be the simplest, but I'm up for a more complicated workflow that gives better results.
So.... thoughts and guidance? Thank you!