Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

OCR Page Namer

Convert page images into searchable text. Talk about software, techniques, and new developments here.
Posts: 496
Joined: 04 Mar 2014, 00:53

Re: OCR Page Namer

Post by univurshul » 14 Nov 2010, 12:34

dansheffler wrote:I use File Wrangler to rename my pages.
dansheffler, does FIle Wrangler allow you to use OSX's quick look feature to scroll through your finished compiled order-set and visually determine if there's any ordering errors?


Re: OCR Page Namer

Post by dansheffler » 14 Nov 2010, 13:32

If there is, I haven't been able to figure it out. I just have a finder window open along side of it to do just that.


Re: OCR Page Namer

Post by Anonymous1 » 15 Nov 2010, 14:18

For those who don't use Mac OS, use Métamorphose. I use it a lot for batch renaming, and It is really good at what it does.

It's written in Python, works with Linux and Windows, and is completely free and open source. It supports Regex, file attributes, and other metadata for music files and pictures. I really love it. Here are some screenshots (they're big, so they will be links):
http://file-folder-ren.sourceforge.net/ ... _large.png
http://file-folder-ren.sourceforge.net/ ... _large.png
Last edited by Anonymous1 on 16 Nov 2010, 01:01, edited 1 time in total.


Re: OCR Page Namer

Post by Anonymous1 » 15 Nov 2010, 14:22

If anyone's interested, I'm almost done with the GUI version of the OCR page naming program.

Currently, these features have been implemented:
  • Drag-and-drop file selector
  • File previews
  • Normal file selector
  • File conversion (Tesseract only reads TIFF images. The software will prompt you to convert the file(s). If you don't convert, it removes the file from the queue).
  • Manual offset input
The next set of features will be:
  • Scan Tailor-like mask selector for the page numbers.
  • Left and right pages
  • ???
I'll post screenshots later.

Posts: 102
Joined: 18 Oct 2010, 10:36

Re: OCR Page Namer

Post by ibr4him » 16 Nov 2010, 00:27

Well, who isn't interested? :)

Too much to ask, but erm..can you somehow make it recognize non-english (arabic) numbers? Please see: http://upload.wikimedia.org/wikipedia/c ... als-en.svg

It'd be a huge time saver for it if possible. Many Thanks!


Re: OCR Page Namer

Post by Anonymous1 » 16 Nov 2010, 00:59

Hmm, That isn't really dependent upon me. You will have to teach Tesseract to do that, as all I am doing is cropping the image to just that box and letting Tesseract do the work. As far as I can tell, it's REALLY hard to recognize Arabic and other connected scripts, but you are welcome to try and train Tesseract to do Arabic numbers.

Here's how you do it (I haven't played with it yet): http://code.google.com/p/tesseract-ocr/ ... Tesseract3

Good luck!


Re: OCR Page Namer

Post by Anonymous1 » 20 Dec 2010, 00:56

For anyone that's still interested, the semester is over, and I had some time to update the script. I've abandoned the GUI, as it is way too hard to work with. CLI is good for now.

Currently, this is what it does (It's tailored to my needs, but if you could supply me with sample pages, I could add more functionality to it):
  • Reads all TIFF files from a directory. It doesn't process non-TIFF images (it's a Tesseract thing. I can't do much about it).
  • Currently reads page numbers from the top or bottom of page. I always export Scan Tailors pages individually (i.e. no resizing, margins, etc.), so it expects the pages to have the text very close to the top/bottom. I can use GIMP's CLI interface to autocrop (temporarily) in order to process the image.
  • Sorts files into three categories: Finished, Ambiguous, and Unknown.
Finished: Completely done pages. I always double check, as it reads numbers from tiny pages too (I'm working on making it look for similar-sized pages).
Ambiguous: This took me a while to figure out. When the OCR reads a page and finds that it has a conflicting number (for whatever reason), it throws the current page and the page with the same number as the current page into the Ambiguous bin. It's a safety precaution.
Unknown: Self explanatory. The OCR couldn't find any page numbers.

Just for my personal use, I added some statistics to it and a page checker. It tells you what pages you are missing, which is dependent upon how well you answer the script's questions.

As a future roadmap, I plan to add these features:
  • Left/right page detection. This shouldn't be too hard, but I haven't encountered any books like this yet.
  • Blob-based selection. Currently, I rely upon a user's coordinates to detect pages, which works. I was looking into OpenCV, and I think it would be quite good for this project. This is unlikely to happen though.
  • Other stuff?
  • And finally, a GUI. Notice that it goes after the other stuff, as it's a pain.
And as for the dependencies, it only requires Tesseract OCR, Python, and ImageMagick. I hope this comes in handy (don't worry, I'll keep working on at, as the books keep on scanning)!

Post Reply