My DIY Scanner

dpc · Post by **dpc** » 23 Dec 2016, 18:05

I gave up on OCR and have resorted to using Adobe Acrobat's Clear Scan mode for PDFs. The amount of hand-holding required to produce error-free OCR'ed text was too burdensome, and besides I was digitizing books from my personal library and there are cases where I want the notes in the margins and various highlights and undernlines that I've made over the years to remain in the digitized versions.

More pixels per text character is the result of increasing the DPI so I'm afraid I don't understand when you say that you're stuck at 150 DPI no matter the zoom. That doesn't make sense to me unless that is some quirk in the software you're using.

cday · Post by **cday** » 23 Dec 2016, 18:15

dpc wrote:More pixels per text character is the result of increasing the DPI so I'm afraid I don't understand when you say that you're stuck at 150 DPI no matter the zoom. That doesn't make sense to me unless that is some quirk in the software you're using.

The page images have a fixed number of pixels, I think the problem is that the page dimensions indicated are incorrect, too large, so that the DPI value shown is too low: the page dimensions can easily be corrected using image editing software so that the correct DPI is shown, but the pixels per character, which is what determines the displayed quality, will be unchanged...

BillGill · Post by **BillGill** » 23 Dec 2016, 20:52

More pixels per character is based on the magnification of the page by the zoom. If you zoom 2 times then the number of pixels in the picture size will remain the same, but the space in the picture occupied by a character will be twice as large. So you wind up with more pixels per character. Possibly you have been misled by a common problem that is not necessarily clearly addressed in camera specs. That is the difference between optical and digital zoom. Using optical zoom the actual magnification from the lens is increased. That is a scene that might be 20 feet wide will completely fill the sensor in the camera at one zoom level. At a larger zoom level only a part of the scene will fill the sensor. With digital zoom you are actually just cropping the picture as it appears on the sensor. An explanation can be found at http://www.dummies.com/photography/digi ... al-camera/. When using a camera for digitizing books you want to use only optical zoom.

You are right about the dimensions being wrong. I checked one scan and according to the computer the picture is 28 inches wide. Obviously I am not trying to scan a book with 28 inch pages.

I am using an OCR because I am generating EPUBs. So I need the actual words. And you are right about the amount of work involved. I generally do postprocessing on one chapter at a time. Getting one chapter from picture to text ready to actually be edited takes me about a half an hour. After that I get to start working. Correcting all the errors in the text, using a word processor, Is the long end of the task. I go through each chapter at least 3 times making corrections. Then I put the chapters together to make the full length text of the book and go through again, still making corrections, then I convert it to an EPUB in Calibre and go through making more corrections. I expect I could go through it 100 times and still be finding errors.

cday · Post by **cday** » 24 Dec 2016, 03:33

BillGill wrote:More pixels per character is based on the magnification of the page by the zoom. If you zoom 2 times then the number of pixels in the picture size will remain the same, but the space in the picture occupied by a character will be twice as large. So you wind up with more pixels per character. Possibly you have been misled by a common problem that is not necessarily clearly addressed in camera specs. That is the difference between optical and digital zoom...

I'm not a camera user, my comment related only to your statement that the DPI values you were seeing were suspect and that the page sizes were also too large: for a given image reducing the 'canvas' or 'print' size using image editing software will result in a corresponding increase in the DPI value shown, but not increase the number of pixels per character which determines the text quality.

BillGill wrote:I am using an OCR because I am generating EPUBs. So I need the actual words. And you are right about the amount of work involved. I generally do postprocessing on one chapter at a time. Getting one chapter from picture to text ready to actually be edited takes me about a half an hour. After that I get to start working. Correcting all the errors in the text, using a word processor, Is the long end of the task. I go through each chapter at least 3 times making corrections. Then I put the chapters together to make the full length text of the book and go through again, still making corrections, then I convert it to an EPUB in Calibre and go through making more corrections. I expect I could go through it 100 times and still be finding errors.

Which OCR software are you using, out of interest, because for reasonable quality images with simple formatting and a DPI in the range 200 to 300, results using current commercial software shouldn't be too bad?

Edit:

If you care to upload an example page image, and the actual page dimensions, I'll try to set the indicated size to match the true size, so that the actual DPI of the text can be determined.

And a typical page image (after any processing you perform) would also allow anyone to see what output they can obtain using other OCR software.

BillGill · Post by **BillGill** » 24 Dec 2016, 10:35

I am using OmniPage 18 for OCR. It does a pretty good job. Looking around it seems to be generally felt that OmniPage is about the best available. The problem with the resolution is kind of annoying, but I have played around in Photoshop to see about correcting it so that I get the more reasonable 300 DPI resolution and it didn't help. The biggest problem that the 180 DPI resolution gives me is that I wind up with a carriage return at the end of every line of text. So I have to process it to get rid of all the extra carriage returns. I have developed a procedure that works pretty good. I go through and insert a flag at the end of each actual paragraph. Then I do a search and replace to replace all the carriage returns with spaces. Then once more to replace all the flags with carriage returns. Then one more time to replace everyplace that has a carriage return followed by a space with just a carriage return. And of course I use a batch process to do all of that automatically. For one chapter this takes about 20 minutes. I include the time for this in my estimate of postprocessing time, that takes about 30 minutes for 1 chapter.

As far as the editing after that. That is just naturally going to be the way it is. The amount required depends on the quality of the scans. And that depends to a large extent on the quality of the source. I am scanning from a variety of materials, but a lot of them are old paperback books. So they can have pretty poor text for scanning. They have faded ink, yellow paper, and speckles. All of which reduce the quality of the OCR output. Heck, even with a perfect scan the OCR will have errors. I have seen someplace somebody claiming 99% accuracy with their OCR. Doing a little arithmetic that would suggest that, based on 5 characters per word, every 20 words would have an error. In a book with 800,000 words that makes quite a few. But as I said you don't get perfect scans, so there are a lot of errors in a book.

cday · Post by **cday** » 24 Dec 2016, 11:26

BillGill wrote:The problem with the resolution is kind of annoying, but I have played around in Photoshop to see about correcting it so that I get the more reasonable 300 DPI resolution and it didn't help.

I would assume that OCR recognition accuracy depends on the number of pixels per character, which won't be changed by correcting the DPI value of the image, rather than the actual DPI, so I wouldn't expect that correcting the DPI value would increase recognition accuracy...

The biggest problem that the 180 DPI resolution gives me is that I wind up with a carriage return at the end of every line of text. So I have to process it to get rid of all the extra carriage returns.

I don't immediately understand why that happens...

Heck, even with a perfect scan the OCR will have errors. I have seen someplace somebody claiming 99% accuracy with their OCR. Doing a little arithmetic that would suggest that, based on 5 characters per word, every 20 words would have an error.

I would assume that would be accuracy per word rather than per character, although that might be incorrect, and also for a high-quality image.

Do you have an example page you could upload, in case someone can make a useful suggestion?

BillGill · Post by **BillGill** » 24 Dec 2016, 14:11

I don't understand why it puts the extra carriage returns in either. It doesn't happen on a document that is scanned in through a 300 DPI scanner, rather than a photo. I contacted Nuance about it and they didn't get me any help, so I live with it.

I have no idea how they defined their accuracy, so I just take the simplest definition. Anyway, other than the extra carriage returns, the basic difference is in the source material. Better material gives me a better scan and a better document, with a lot less editing. But it still takes a lot of editing. Even if I am working from original files I keep finding things that need to be corrected. I do some work from original files. I am working with an author to create ebooks from some of her work. They still have plenty of errors. So I don't worry too much about the OCR, it is pretty much the short end of the stick.

I don't use all the features of the OCR anyway. When I save the text to be inserted into a word processor (I am using Open Office Writer) I don't try to save it in a formatted form, I just save it as a plain text file (*.txt). If I try saving all the formatting I wind up with weird page sizes and fonts that have to be corrected before I can really get to doing the editing. OCR software tries to do an accurate job of formatting, but it has to make judgement calls based on what it sees in the image. That can be misleading.

The net result is that I'm not really concerned about the OCR operation. It is doing an adequate job. Obviously it could do better, but it takes a lot of research and development to improve on something like this. Trying to write software that can duplicate the human mind is a hugely challenging task. I'm just happy that it does as good a job as it does.

The big thing is that I would like to replace the Pi-Scan with software that would run on a PC and give me greater control. With Pi-Scan I get a pretty good product, but it is hard to tell if something goes wrong in the middle of a scan session. Then I don't find out about it until I transfer the pictures to my PC and start processing them. If there is an error I have to go back and redo the scan, or at least the part of it that is bad.

cday · Post by **cday** » 24 Dec 2016, 15:15

So your biggest problem is the extra carriage returns, but it's not really a problem?

In the absence of one of your images, I've had a look at OCR'ing one of my scans in OmniPage 18, and then saving the resulting output as a text file; in the save menu I notice that there are actually three options for saving as *.txt :

Text (*.txt)
Text - Formatted (*.txt)
Text with line breaks (*.txt)

Looking at the resulting output text from each, the first option Text (*.txt) gives output with line breaks only at the end of paragraphs, as you would prefer, the other two options give output with a line break at the end of every line, as you say you are obtaining with some input images.

My test image is 400DPI scanned on a flatbed scanner (and black and white) but I don't immediately see that the DPI is likely to be a factor in the results...

BillGill · Post by **BillGill** » 24 Dec 2016, 19:38

It may not be obvious that the lower resolution (180 DPI) would make a difference, but it does. I have only used input from a flatbed scanner at 300 DPI one time, but that time it worked just fine, when I switched to photos at 180 DPI It quit working correctly. So I am just living with it. As I say it is a minor annoyance, but I can live with it. It is a pretty small part of the overall job of creating an EPUB from a book. It takes about a half an hour for me to go from pictures to a word processor document, with the carriage returns fixed. That is for one chapter. After that as I said it is the editing that takes time. Overall It takes me around 2 weeks to turn a print book into an EPUB. I am retired so I have plenty of time, although I don't work full time on the book in progress. I take plenty of breaks to read and watch TV and go shopping and what have you.

BruceG · Post by **BruceG** » 25 Dec 2016, 17:56

I am also doing some work making epub file for my ereader using Omnipage (19 or Ultimate) for OCR. These are public domain books that some one else is scanning and making available as a searchable pdf.
My process is
1. select only one font (I have chosen -Times New Roman) ie. force Omnipage to use only that font
2. load files - some times some pages need resizing to do this. Usually maps or photos that for some reason were scanned differently
3. zone 1 page at a time - leaving headers/footers not zoned
4. OCR that page and edit as/if required
5. continue page by page
6 save as Epub file
7. check with ereader while editing in Calibre E Book editor. Most common problem is with lines going across pages, for some reason blank lines/spaces are inserted during the saving or reading the epub file. Does not happen as often when the sentence finishes at the bottom of the page.

I would like to start chapters on a new page, but have not learnt how to do that.

I would be happy to look at and be challenged by a few chapters if you could upload or provide a link to the original photos

DIY Book Scanner

My DIY Scanner

Re: My DIY Scanner

Re: My DIY Scanner

Re: My DIY Scanner

Re: My DIY Scanner

Re: My DIY Scanner

Re: My DIY Scanner

Re: My DIY Scanner

Re: My DIY Scanner

Re: My DIY Scanner

Re: My DIY Scanner