Freeware Windows workflow: 40Mb 400pg OCR'd A4 book pdfs

Share your software workflow. Write up your tips and tricks on how to scan, digitize, OCR, and bind ebooks.

Moderator: peterZ

victoriaaustralia
Posts: 55
Joined: 07 Nov 2011, 16:22
E-book readers owned: newton
Number of books owned: 2
Country: Australia
Location: Castlemaine, Victoria, Australia

Freeware Windows workflow: 40Mb 400pg OCR'd A4 book pdfs

Post by victoriaaustralia »

I am using two canon powershot, 10MP camera, triggered manually with an older Bookrppr/Book Liberator design
camera images go into two folders, named left and right. I find it useful to record every page, including blank pages. I stick a post-it note onto the blank page for the camera auto-focus look at. Having images of blank pages helps with maintaining book page numbering.
Ant-Renamer is used to rename in the four digit format as this is what Homer likes (0002, 0004, 0006 for example for the left folder images, 0001, 0003 for right pages).
You should then have correct page numbered right and left pages.
These two folder contents are then placed in a folder named Combined.
This folder is opened with ScanTailor. Image dpi is set at 300dpi for all pages
Scantailor process as per Vimeo tutorial
The .TIFF files in the out folder are then used in Homer. Make sure only the .TIFF files are in the folder, remove the Homer Cache files or Homer will be confused.
Overview of Homer here: http://www.diybookscanner.org/forum/vie ... 588#p13588
Homer for me does not OCR but does create lovely small, quality PDF files easily, without having to struggle with pdfbeads, ruby or gem environments.
This PDF file is the OCR'd using PDF-Xchange Viewer , this creates text as a separate layer. FOr my 400 page book this was only 2MB extra.
If any rearranging of the PDF is needed, the free version of PDF-Xchange Viewer does not do this. I use PDF SAM (PDF Split And Merge) if pages need shuffling or deleting.
Freeware Windows workflow in 2020
viewtopic.php?f=19&t=3620
spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: Freeware Windows workflow: 40Mb 400pg OCR'd A4 book pdfs

Post by spamsickle »

Thanks. I meant to try Homer when I saw your praise for it previously, but still haven't gotten around to it. I have several 200+ MB PDF files I created with Scan Tailor, ImageMagick, and pdftk (though I'm not sure how many of those I still have the ScanTailor TIFFs for), and while the size doesn't really bother me, I'm sure my other applications would be happier if the Acrobat Reader was only dealing with 20-40 MB files.

Did you ever get your OCR issue resolved? I think you said Homer was doing the OCR, but the results didn't make it into a PDF layer.

Edit: Never mind, I see your post says the OCR is still not working properly for you with Homer. I don't tend to do OCR anyway, so the compression would be enough for me.
victoriaaustralia
Posts: 55
Joined: 07 Nov 2011, 16:22
E-book readers owned: newton
Number of books owned: 2
Country: Australia
Location: Castlemaine, Victoria, Australia

Re: Freeware Windows workflow: 40Mb 400pg OCR'd A4 book pdfs

Post by victoriaaustralia »

That is correct about Homer and OCR for me. Something I am doing means Homer has not worked to OCR for me (tried on a vista windows machine and Mac OS 10.6). The pdfbeads compression side of the package works beautifully every time but there is no OCR component output with the PDF.

However as above I have found the PDF xchange viewer to do an excellent freeware OCR job and so then take the HOmer/pdfbeads output PDF and feed it into PDF xchange viewer.

The end result is the same with a separate text layer in the PDF. As others have said this works well on computers but less well on eBook readers.
Freeware Windows workflow in 2020
viewtopic.php?f=19&t=3620
ncraun
Posts: 11
Joined: 27 Jul 2013, 10:08
E-book readers owned: 1
Number of books owned: 0
Country: USA

Re: Freeware Windows workflow: 40Mb 400pg OCR'd A4 book pdfs

Post by ncraun »

If you are processing black and white pages, especially of plain text, then using the JBIG2 encoder instead of JPEG will create a much smaller end file size. JBIG2 is a bitonal image compression algorithm similar to djvu's cjb2. pdfbeads will automatically use JBIG2 encoding if your source images are black and white and you have an appropriate JBIG2 encoder on installed on your system. I would recommend jbig2enc, it is free software, fast, and produces great results. You can get the source code here: https://github.com/agl/jbig2enc

I haven't tested it on Windows, because I primarily use Linux based systems, but the source code is available so you can try to compile it for yourself.

For an example about the reduction in filesize, I was able to get a 576 page book with full OCR and bookmarks at a filesize of 7.6mb (5.7mb without OCR and bookmarks).

I am working on a writeup of my own process in book scanning, I'll post on the forum when it is done. If you have any more questions about this, you can just ask me.
rkomar
Posts: 98
Joined: 12 May 2013, 16:36
E-book readers owned: PRS-505, PocketBook 902, PRS-T1, PocketBook 623, PocketBook 840
Number of books owned: 3000
Country: Canada

Re: Freeware Windows workflow: 40Mb 400pg OCR'd A4 book pdfs

Post by rkomar »

ncraun wrote:If you are processing black and white pages, especially of plain text, then using the JBIG2 encoder instead of JPEG will create a much smaller end file size. JBIG2 is a bitonal image compression algorithm similar to djvu's cjb2. pdfbeads will automatically use JBIG2 encoding if your source images are black and white and you have an appropriate JBIG2 encoder on installed on your system. I would recommend jbig2enc, it is free software, fast, and produces great results. You can get the source code here: https://github.com/agl/jbig2enc
I've always been compressing using CCITT Group 4 (fax) compression (" convert -compress Group4" with ImageMagick) for bitonal images, and it seems to be pretty good. Have you compared the output file sizes from the two compression schemes? I expect JBIG2 will compress better, but I wonder by how much.

Edit: I tried myself and got a 174 page bitonal PDF file down to 1MB with JBIG2 compression from 12MB using Group4 compression. Very impressive!
cday
Posts: 447
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Freeware Windows workflow: 40Mb 400pg OCR'd A4 book pdfs

Post by cday »

JBIG2 comes in lossless and lossy versions and the specification allows developers considerable freedom to use their ingenuity in how they implement the format, with the result that different implementations may differ slightly in their characteristics.

In the lossless form, a table of bit patterns is built up so that similar patterns only need to be stored once, saving space. As imaged font characters often have stray pixels on the edges of the character outline, and these will typically occur in different positions, many instances of each character will therefore be stored. When the file is viewed the page image will be reproduced exactly, complete with any stray pixels.

In the lossy form, similar characters are recognised so that only one instance of each character need be stored, potentially resulting in significantly greater compression. In addition stray pixels on the character outline can potentially be identified and removed in the final stored pattern, resulting in cleaner text. The viewed page image will therefore typically not be quite identical to the original image. A limitation to be aware of is that any characters that are misidentified for any reason in the encoding stage will be reproduced incorrectly as different characters. That is likely to be more of an issue with lower-resolution input images.

I think that's about right based on some study a while back. Wikipedia has an entry for JBIG2 with some useful links for more detailed understanding.
ncraun
Posts: 11
Joined: 27 Jul 2013, 10:08
E-book readers owned: 1
Number of books owned: 0
Country: USA

Re: Freeware Windows workflow: 40Mb 400pg OCR'd A4 book pdfs

Post by ncraun »

rkomar wrote:Have you compared the output file sizes from the two compression schemes? I expect JBIG2 will compress better, but I wonder by how much.
I haven't used G4 compression myself, but I recall in the DjVu 3 Specification they mention that JB2 (A compression method used in DjVu Bitonal images, similar to JBIG2) creates files 6 times smaller than Fax G4.
cday wrote: A limitation to be aware of is that any characters that are misidentified for any reason in the encoding stage will be reproduced incorrectly as different characters. That is likely to be more of an issue with lower-resolution input images.
This is a good point to bring up. When I was post processing some low res poorly scanned document for a friend, I noticed that some similar characters ended up getting mistakenly intentified (some 'a's became 'o', and so on). However, if you have a good quality input scan then this is not really much of a problem. For example, with my documents at 300 and 600 dpi, there were no character misidentifications. I suppose the takeaway is to have a good quality input scan, but then as a bookscanner you should know that anyway. But, it's a good thing to keep in mind, so that if something does go wrong you know what could be causing the issue.

One thing I think would be interesting would be a JBIG2 aware OCR program that would allow you to associate each of the JBIG2 table entries to characters, sort of like the old windows SubRip programming for doing OCR on DVD subtitles (which were stored as images, not text). The user could have the option to select the automatically recognized character (created by piping that single character through something like tesseract), or to use a different character. Then the program would associate the selected character in the position of the JBIG2 shape, for every instance of that shape across the whole book. Additionally, the program would know the exact position to place the character, as it would be provided by the JBIG2 shape data. I don't know how well this imagined program would compare to current OCR tools, it might not be worth it to create.
cday
Posts: 447
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Freeware Windows workflow: 40Mb 400pg OCR'd A4 book pdfs

Post by cday »

An interesting example of JBIG2 lossey changing numbers when documents are photocopied:

http://www.bbc.co.uk/news/technology-23588202

In this case not just a word with an incorrect letter, but numbers altered where the consequences could have serious implications: an inappropriate use of the format by a major company.

The same issue could arise with Adobe ClearScan, which is also dependent on correct recognition of characters, if it were used in a sensitive situation.
victoriaaustralia
Posts: 55
Joined: 07 Nov 2011, 16:22
E-book readers owned: newton
Number of books owned: 2
Country: Australia
Location: Castlemaine, Victoria, Australia

Re: Freeware Windows workflow: 40Mb 400pg OCR'd A4 book pdfs

Post by victoriaaustralia »

victoriaaustralia wrote: Homer for me does not OCR but does create lovely small, quality PDF files easily, without having to struggle with pdfbeads, ruby or gem environments.
This PDF file is the OCR'd using PDF-Xchange Viewer , this creates text as a separate layer. FOr my 400 page book this was only 2MB extra.
If any rearranging of the PDF is needed, the free version of PDF-Xchange Viewer does not do this. I use PDF SAM (PDF Split And Merge) if pages need shuffling or deleting.
Well I just formated the HD and re-installed Windows Vista, other programs and then Homer and now the OCR portion works fine! I must have done something to interfere with Tesseract on my prior install of Homer. So now no need for the PDF-viewer step. Homer has just compressed the tiffs from Scantailor, 613pages, 1.06GB to a 30.9MB OCR'd pdf with a mixture of grayscale images and black and white.
Freeware Windows workflow in 2020
viewtopic.php?f=19&t=3620
victoriaaustralia
Posts: 55
Joined: 07 Nov 2011, 16:22
E-book readers owned: newton
Number of books owned: 2
Country: Australia
Location: Castlemaine, Victoria, Australia

Re: Freeware Windows workflow: 40Mb 400pg OCR'd A4 book pdfs

Post by victoriaaustralia »

As per this thread:
http://www.diybookscanner.org/forum/vie ... 906#p16906

I was having what appeared to be an intermittant failure of Homer to run the OCR/tesseract component but not the compression/pdfbeads component. It turns out that file name length was too long. Files seem to need to be format 0001.tiff, 0002.tiff etc and no longer.
Freeware Windows workflow in 2020
viewtopic.php?f=19&t=3620
Post Reply