Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

My DIY Scanner

Built a scanner? Started to build a scanner? Record your progress here. Doesn't need to be a whole scanner - triggers and other parts are fine. Commercial scanners are fine too.
BillGill
Posts: 117
Joined: 18 Dec 2016, 17:13
E-book readers owned: Calibre, FBReader
Number of books owned: 7000
Country: USA

Re: My DIY Scanner

Post by BillGill » 25 Dec 2016, 19:22

Thanks for the process you give. I will try that and see if I get any better results. When I do my editing in a Word Processor I divide the chapters with some kind of a break. In Word I used division breaks, now in Open Office Writer I use page breaks between chapters. At first when I converted the Open Office files to EPUBs using Caliber they didn't get chapter breaks. I'm not sure now what I did to Calibre. I made a change in the settings and it now puts each chapter in a separate text file. I make sure that the heading for each chapter is styled as a heading. That may help, but I couldn't say for sure. Calibre is rather complicated, I just studied enough to get through my digitizing, there is a lot more to it than just that.

If you don't get separate files for each chapter you can split the files generated by Calibre. If you haven't found it it is a button at the bottom of the preview page. It looks like 2 pages one above the other. The help message when you hover over it tells you what to do. That may cause the chapters to start on a new page.

Here are a couple of pages out of the book I am currently working on.
0036.jpg
0037.jpg
If you want to try them you are welcome.

BruceG
Posts: 71
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: My DIY Scanner

Post by BruceG » 26 Dec 2016, 03:40

This is a challange. What is the page size or what is covered by the photos. At 180 pixels one page is 29.72 by 36.32 cm and the other 31.07 by 54.69 cm. Is there a reason for the different dimensions. I am sure the font size is not 28 which is required for this page size.

BillGill
Posts: 117
Joined: 18 Dec 2016, 17:13
E-book readers owned: Calibre, FBReader
Number of books owned: 7000
Country: USA

Re: My DIY Scanner

Post by BillGill » 26 Dec 2016, 10:32

That's where a lot of the problem comes in. The actual page size is approximately 4" x 7" (10 X 17.5 cm"). The pictures I sent are the cropped versions, with just the text so they are a bit smaller than the actual page. The computer gives an extremely exaggerated page size which confuses everything. I have no idea why the page size shows up so large in the computer. Well, I can see that if the camera has a standard picture size that ignores the zoom it might confuse the dimensions, but I haven't found a way to fix them. So as I say I just live with it and export everything as text files.

The difference in the dimensions of the 2 pages is that I cropped them differently. Basically I just wanted to get the text, without any more background than I needed so that the OCR wouldn't pick up blemishes and try to turn them into text.

Just did a quick check. The image size measurement anomaly isn't unique to the Cannon cameras I am using. I did a test shot with a different camera before I built my scanner, and it also shows extremely large sizes for the pages.

BillGill
Posts: 117
Joined: 18 Dec 2016, 17:13
E-book readers owned: Calibre, FBReader
Number of books owned: 7000
Country: USA

Re: My DIY Scanner

Post by BillGill » 26 Dec 2016, 13:51

OK, I did some research on the internet and then did some experiments. The physical dimensions, in inches or cms, that we are seeing are the dimensions that the picture can be displayed at at the stated DPI setting. I opened a picture in PhotoShop and changed the image size (image/resize/image size). When I changed the image size from bignum inches by bignum inches to approximately the size of the page I had photographed the DPI changed to around 300 DPI. I then loaded that modified picture into the OCR software and it came out much closer. The page had the paragraph marks only at the end of the paragraphs, and the font size was closer. It was still not quite right. It was something like 17 points. But that is so much closer. I will have to do some more experiments to figure out just how I need to make the pictures correct, but this is a big step forward.

Also of course PhotoShop has a batch processing option under the File menu. With that it should be possible to resize all of the pictures from a particular book to the correct size in one pass. This should take care of some of my OCR problems. There will still be plenty of editing to correct all the typos the OCR inserts.

cday
Posts: 246
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: My DIY Scanner

Post by cday » 26 Dec 2016, 14:38

I can confirm from some tests that I've been doing since you uploaded your example pages that increasing the image DPI value can substantially improve the OCR output in terms of eliminating unwanted paragraph breaks; from my tests 400 DPI or even slightly higher is a better match to the page size for the approximate dimensions you've given.

I've also possibly seen some benefit from deskewing the image, but there are so many enhancement that could be tried both to avoid unwanted paragraph marks and also improve recognition of difficult characters, that if you are already getting acceptable output there is a limit to the time it is worth spending.

With regard to text point size, does the value shown in OmniPage matter when point size can very easily be changed for the whole document using a word processor?

And if unwanted paragraph breaks can be largely eliminated, would proofing the output directly in the Omnipage editing window with the paragraph marks displayed save time overall: you could easily edit out any unwanted breaks and also correct any text errors immediately, and then save the file as an edited text file.

BillGill
Posts: 117
Joined: 18 Dec 2016, 17:13
E-book readers owned: Calibre, FBReader
Number of books owned: 7000
Country: USA

Re: My DIY Scanner

Post by BillGill » 26 Dec 2016, 19:50

In the future I plan to reset the page size to match the actual dimensions of the image. I plan to get some stick-on rulers and place them on the cradle that holds the books. Then when I get the images into my computer I will reset the page size to the size measured on the picture. That way I should get the optimum sizing and the font size should come out pretty good.

Omnipage does a pretty good job without doing any deskewing. But I am till working on getting the cameras mounts adjusted so there there will be very little skewing. The samples I gave still had some, but since then I have gone back and shimmed the cameras to reduce what is left. Actually the skewing affects my cropping of the picture more than it does the OCR. If the gutter margin is small when it comes time to crop it then skew can cause the crop to slip off of the desired part of the picture. I think it is easier to get the cameras adjusted once and have them properly aligned than it is to deskew the pictures. That was one of the reasons I wanted to make a better camera mount.

As far as text size is concerned. It is probably better to have properly sized text in case there is a place where the text size in the document changes. Right now going through the straight text file I lose all that information, along with special character formatting, such as italics. Getting a properly formatted font size will be a help. I will probably have to go ahead and do some resizing in any case. I really don't expect the OCR to consistently get it right every time.

For editing I think I prefer to transfer the document to a real word processor. A real word processor has a lot of facilities for formatting text that won't be available in the Omnipage editor. Sending it to a word processor isn't a problem. In the 'Save to File' menu in Omnipage there is a large selection of file formats. You should be able to find a format that can be used by just about any word processor. Mine doesn't have a *.odt output for Open Office, but it does have a *.rtf which can be edited with Open Office.

Some free OCR software doesn't have the flexibility to do all of this. In that case you might find it best to go through the *.txt file format.

BruceG
Posts: 71
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: My DIY Scanner

Post by BruceG » 26 Dec 2016, 20:03

I wrote before seeing the post above
This link is the epub file of the two pages. If all pages were processed the same size and the dimensions were set the same, things would be a lot easier.

https://www.dropbox.com/s/ijsss7sx49ke9 ... .epub?dl=0

The first page had mostly non text errors line/dots etc., I think there was only one on the second page.
Most errors can be fixed in Omnipage as mentioned including font sizes. It is just better if the page size is set first, then no action is required.

Deskewing will prevent different size margins if that is a problem. YASW is not a bad program for Rotation - Dekeystoning - cropping and scaling. David Landin has produced a good video on its use.

BillGill
Posts: 117
Joined: 18 Dec 2016, 17:13
E-book readers owned: Calibre, FBReader
Number of books owned: 7000
Country: USA

Re: My DIY Scanner

Post by BillGill » 27 Dec 2016, 00:01

I just did another test. First I used Photoshop to adjust the dimensions in inches to match the actual dimensions of the area the picture covers. This corrected the DPI to a more realistic figure. Then I cropped the page. For this test I chose a page that only had about a half a page of text with a large blank area. I checked the DPI after the resize and then after the crop. The DPI matched in both cases. That being the case I think that the amount that it is cropped won't affect the font size. As long as the DPI is the same then the OCR should detect the same font size, since the dots per character will be the same. I haven't taken the experiment that far, so I can't positively say it will be that way, but I expect it to be pretty much how it will work.

You can do your editing in Omnipage. It should work just fine. However, as I said I prefer to use a word processor because of its greater capabilities. This may of course be partly because I am used to word processors. I have written many pages using Word. I'm not using Word now, but Open Office provides just about as much power, and is free.

As far as de-skewing and de-keystoning are concerned. Now that I have gotten a nice stable camera mount on my digitizer I don't expect to have any problems that way. I have gotten my procedures for cropping the scans in Photoshop working so that I can get that part of the job done pretty shortly, so I think I will go ahead and stick to what I know works there.

dpc
Posts: 315
Joined: 01 Apr 2011, 18:05
Number of books owned: 0
Location: Issaquah, WA

Re: My DIY Scanner

Post by dpc » 27 Dec 2016, 13:39

When I scan a book I first take a few shots of calibration pages so that I can use them as input to front end stages of my software processing pipeline. One of these calibration pages is simply a 2"x2" black square positioned in the center of a white sheet of paper. During post-processing of the images one of my image processors looks at this calibration image and notes the width and height of the 2"x2" square in pixels. It then positively knows how many pixels per inch (aka DPI) the scanned images will be and passes that value down the pipeline to subsequent processing steps. All of this is automated (other than me having to shoot the calibration pages) and eliminates the sort of problems that you're having.

BillGill
Posts: 117
Joined: 18 Dec 2016, 17:13
E-book readers owned: Calibre, FBReader
Number of books owned: 7000
Country: USA

Re: My DIY Scanner

Post by BillGill » 27 Dec 2016, 15:07

That sounds interesting. What program are you using? If it is available I might want to investigate it.

Post Reply