Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

Getting OCR + Uniform Lettering without the corrections?

General discussion about software packages and releases, new software you've found, and threads by programmers and script writers.
Post Reply
Marcus Eastty
Posts: 9
Joined: 13 Apr 2015, 01:45
Number of books owned: 0
Country: United States

Getting OCR + Uniform Lettering without the corrections?

Post by Marcus Eastty » 14 Apr 2015, 16:08

Hey everybody,

First of all, I owe the site and the contributors to free software a huge thank you (and definitely a donation if I can figure out how to work out some kinks).

I finished my first scan about a week ago using a basic camera/tripod-aimed-at-an-angled-wedge-with-piece-of-glass-over-each-page-type set-up, one page at a time.
I put it through Scantailor, and the results were "good enough" considering it was my first time using it.
Loaded it into Adobe Acrobat Pro DC and ran OCR. It worked, all the words are recognized and the text is searchable. The document is passable for reading, but the file size is much larger than I was expecting. Running Pro DC's version of "clearscan" did not help (the "editable text and images option in OCR" I have seen reported on the Internet as ClearScan's equivalent. I am new to this, so I am unfamiliar with what ClearScan was, but this "editable text and images" option did NOT decrease file size). I noticed that the text appeared to still be that of the original scan - not altered so that each character was uniformly depicted throughout the document. I thought this could be why the file size was so large.

I found if I go to Edit -> Settings -> and check the "Use available system font" option, it will make each letter uniform. Not only that, when I saved a new copy after using "Use available system font," the file size was about half of the original. Super! However, character uniformity seems to be based on what the scanned image looks like and will essentially ignore what OCR found. I know OCR also is based on what the scanned images "look like," but the two must operate independently because their outputs were different.

So for example, the word "grounding" is near a page edge, it's shadowed a little bit more than is ideal and it is a little difficult to read. I run OCR, and it is able to recognize it as the word, "grounding." Super! Now I run the "Use available system font" option, and the word turns into "groundinl." A search no longer recognizes the word "grounding" anywhere in the document. This "use available system font" seems to have undone OCR and it seems to be quite inaccurate, inaccurate to the degree that it is not feasible to manually fix the mistakes (There were far more mistakes than just the word grounding, # in place of H, @'s thrown in, etc., as well).

So what I would like is to have uniform lettering throughout the document, have it be accurate, and have it be searchable. I plan to use a much greater PPI next time in hopes of improving accuracy, but is there anything else I can do? Is there any software available that will do what I am looking for? I do not plan to scan more than maybe 5 or 10 books, so something that is relatively inexpensive would be ideal, but all suggestions are certainly welcome.

Thank you all very much,
Marcus

Marcus Eastty
Posts: 9
Joined: 13 Apr 2015, 01:45
Number of books owned: 0
Country: United States

Re: Getting OCR + Uniform Lettering without the corrections

Post by Marcus Eastty » 14 Apr 2015, 16:17

Here are a couple examples.

The first attachment is of a paragraph after seaching for "grounding" after first running the "ClearScan equivalent" OCR.
after OCR.jpg
after OCR.jpg (54.51 KiB) Viewed 3416 times
(Edit: I realize the search term is in fact "ground" in the image. But I've gone back and checked, searching for the term "grounding" yields the same result.)

The second attachment is of the same paragraph after searching for "groundin" (searching for "grounding" now yields no results because it has been converted to "groundinl" after running "Use available system font"
after Uniformity.jpg
after Uniformity.jpg (83.61 KiB) Viewed 3416 times
The quality of the first image is not worse than the quality of the scan as a whole, but even the best parts of the scan are not as easy to read as the second image. Is there a way to get the quality of the second image with the search/character accuracy of the first?

Thanks!

BruceG
Posts: 67
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: Getting OCR + Uniform Lettering without the corrections

Post by BruceG » 22 Apr 2015, 06:12

Marcus
It is for this reason I use a stand alone OCR program. I started this because Adobe 9 had no usable text editor.
I have had a look at Adobe XI and it has a text editor built in, not as good as InFix which I use as a pdf editor but good enough to fix the problems you mention.
It works on a normal (non scanned) pdf file, I have not tried it with ClearScan though.
My v9 does 'reduced file size' so expect newer versions would also. My understanding is that OCR with Adobe a extra layer is made for text that sits over the image. Whereas stand alone OCR programs the image is replaced by text & photos with a saving of space as a result.

Post Reply