DIY Book Scanner

Posted: **03 Aug 2014, 09:20**

I am sorry to ask what might be buried in here... could anybody help me with some post-processing tweaks? After I finish with ST, I bind with Adobe Acrobat XI and recognize, mostly in French, which gets the random times that we use French letters and accents in English letters. (The word fiancee comes to mind)

The results are great - but large. Often, a 300-age book is 20 - 30 megabytes, and often 50 or more if it has many pictures in it (I am an academic historian, and many good history books have several pages of photographs either in the middle or scattered throughout the text). I have seen people magically produce pdfs of the same (DAMNED) book I just did at 1/10th the size. Not only do I want to strangle them for having done the same book I just did, but I also want to find the original person who did it and waterboard them until they give up the secret of making these pdfs so small.

Can anyone give me good tweaks here?

Thank you!

Posted: **04 Aug 2014, 21:09**

Are any of your files shareable? It's a lot easier to offer suggestions if there's something to look at.

Posted: **05 Aug 2014, 04:21**

I really recommend you take a look at some of the free software available.
My prefered stack is using ScanTailor to postprocess and clean up the images, then I run the resulting TIFs through Tesseract for OCRing. In the last step, pdfbeads separates Illustrations from Text and encodes the former with JPEG and the latter with the highly efficient JBIG2 algorithm. These layers are then bundled with the OCR data from the previous step and written to the resulting PDF.

Posted: **05 Aug 2014, 06:59**

Daniel, I am including a copy of a finished-product book. I have more, but this illustrates the problem. These are already recognized.

Unlike ones that I found now, I typically put the front cover, back cover, and then the flaps in the scan, and THEN the text, as I have a tendency to read just that, so why not have the back cover on page two and the flaps on page 3 and 4, especially as this is not intended for public distribution.

Still, I am shocked to find that some of these small ones are not even recognized.

I forgot that I will convert these books to djvu format from pdf, but PDF2DJVU has become extremely unreliable.

I selected a book as an example. I think that it uploaded but am not sure. It was 40 megabytes, and I had to wait quite some time for it to upload. We'll see.

Posted: **05 Aug 2014, 07:57**

jbaiter, I would like to try pdfbeads and tesseract. I have not used any CLI programs at all (ever)for this process, and therefore the amount of things that I would have to be shown is very large. For instance, I just downloaded pdfbeads, and it is in .gem format. Go on and snicker, but my first reaction was, "What the Hell is THAT? How do I unzip that even? How do I install and use it?

Posted: **05 Aug 2014, 10:09**

As you already use Acrobat try to fiddle with the "reduced size PDF" and "optimized PDF" commands in the file > Save as Other... submenu.

Books containing many images can become large in size. Since hard drive space is very, very inexpensive nowadays file size doesn't matter much for many use cases.

Posted: **05 Aug 2014, 12:05**

jaske78 wrote:I selected a book as an example. I think that it uploaded but am not sure. It was 40 megabytes, and I had to wait quite some time for it to upload. We'll see.

The file wasn't uploaded successfully as it doesn't show in your post: almost certainly because it exceeds the attachment size limit for the forum, which is a very small number of MB, I'm not sure of the exact value. I would have thought you should have received an error message, but that didn't happen when I tried to upload a file beyond the limit in a quick test.

You could in principle upload the file to the cloud and include a link to that file in a post.

If you want to upload something useful as an attachment, you could possibly use Acrobat to create a small PDF file consisting of a just a few selected pages from the full file, chosen to represent the various types of content, for example a page consisting only of text, and pages showing the various types of illustration, black and white and colour. Even better, you could possibly do the same for the same pages in the other, much smaller, version of the PDF and upload that too...

jaske78 wrote:I just downloaded pdfbeads, and it is in .gem format. Go on and snicker, but my first reaction was, "What the Hell is THAT?

Google is very good for providing insight into computer topics; I've no idea what it is either, and also have very limited command line experience, but my first reaction is usually to Google for information.

There's a lot that could be written in general terms about dpi, colour modes, image compression options and so on, and the optimum way to format pages of different types to obtain good quality images at minimum file size.

dtic wrote:As you already use Acrobat try to fiddle with the "reduced size PDF" and "optimized PDF" commands in the file > Save as Other... submenu.

That seems good advice for at least a 'quick fix' and should be reasonably easy to try, although it probably wouldn't provide deeper insight...

No-one yet has mentioned the possibility that the other, much smaller, PDF file might use Adobe ClearScan vector text, not only producing very high quality text but also dramatically reducing the file size of a book consisting mainly of text. That should be easily determined if you can upload some sample pages; I think pages extracted from the full file should maintain their basic properties. It could also be determined by zooming very well in and examining the text to see if the quality is maintained.

But there are caveats about using ClearScan as it requires good image resolution to work well, and more particularly, it can introduce errors into the text by sometimes misidentifying characters in the scanned image and then inserting an incorrect character into the text, which as far as I know can't be corrected: possibly not ideal for a text book...

Posted: **09 Aug 2014, 09:01**

I have a dropbox of things for a class I taught last semester, as the University library was not particularly useful for Colonial Latin American history (nor really of any history) and I have learned a sad and painful lesson about lending my books to students.

Here is the link: you are welcome to look over the books there to get a feel for it.

https://www.dropbox.com/sh/ecsu4fkorfn1 ... fzy7IUwPYa

Learning Ruby looks like it will take some time. I would really like to export my PDFs to Tiffs and do something so that they are smaller.

Although you are right that there is cheap storage available, those limits will soon be reached at this rate. I have 5 bookshelves, doublestacked, plus often scan library books in my field that are difficult-to-find, expensive, or both for research purposes. The upward end of my collection is enormous. Already, I have over 500 books that I scanned by hand ON A SCANNER, and that comes to over 20 gigabytes. I need to be careful that I am making practical decisions about space.

Posted: **09 Aug 2014, 17:19**

Well, I've had a look at two of your files, Fisher and Bristol...

I wanted to open the pdf files in a text editor to inspect them (not that I have more than a very small sub-set of knowledge of pdf) but the file sizes made that fairly inpractical. I then decided to extract some representative pages to inspect the coding, but as I don't have a modern version of Adobe Acrobat, I had to use another software and assume that the coding of the extracted pages was unaltered, which I think is likely.

Inspecting the code in the extracted pages, I was able to determine that the large number of pages consisting just of text were compressed using JBIG2, which is the most efficient compression method for black and white images supported in the pdf standard. However, I'm not entirely sure whether Acrobat uses lossless encoding, lossy encoding. or offers a choice. I have the user's manual but am not sure if I'd be any the wiser if I looked in it.

The large number of text pages with colour highlighting were efficiently handled, with a relatively small increase in file size due to the colour on the page, which is interesting and shows the power of the pdf format.

The small number of colour images in the above books were encoded using jpeg compression in my extracted pages, but that may be due to a limitation in the software I used which doesn't support jpeg2000 (j2k) compression: that would produce slightly smaller images, but the overall effect on the total file size would be small unless a book had many illustrations.

So, it looks as if your existing files are already about optimally compressed for the content they contain as far as I can see, the only question being whether JPEG2000 compression of colour images is enabled.

To reduce the filesize further, one option would be to downsample the images to a lower resolution, remembering that the text pages are images too, if the quality loss is acceptable. Another would be to test the Adobe Clearscan option, which could produce a substantial filesize reduction while actually improving on the already quite acceptable text quality.

The above assumes that Acrobat doesn't use JBIG2 lossy, which I think is likely: if that option is available but not selected, it should certainly be tried, remembering that like ClearScan it is a lossy process that could in principle replace a misidentified character by a perfect rendering of another character. But the original scans are generally of good quality, except for some curvature of a few text lines, and the nature of the books could probably tolerate rare errors. In the worst case a name or date could possibly be incorrect, though.

Thinking about the potential reduction in filesize that JBIG2 lossy might bring, I've run some quick tests using Abbyy FineReader 12 which has the option. I basically opened one of your pdf files, ran the OCR process to give searchability, and saved the result as a pdf file using a number of different settings. I obtained some reduction in filesize, although not quite as great as I expected, possibly in part due to the file size of the photograph images which remained unchanged. In passing, FR12 produces more accurate OCR than Acrobat, although as your images are of reasonable quality the difference may not be great.

The original Fisher... pdf was 14.2MB and I hoped to at least halve the size, but to do that I had to use almost the maximum JBIG2 lossy setting. At a quick look the text still looks good, but you should inspect it very carefully if it is of interest, looking particularly at smaller characters such as superscripts, accents and the text where the baseline is curved. Incidentally, I OCR’ed with Spanish selected having read your comment above concerning correct recognition of accents.

Fisher FR12_JBIG2_Lossy_Q10.pdf: (5.77 MiB) Downloaded 583 times

If Adobe ClearScan doesn’t work for you, bearing in mind the possibility as with JBIG2 lossy that a misrecognised character could be replaced by a perfect copy of another character, you could try using FineReader for any future scans, or for creating a searchable pdf file from camera images. You would have to spend some time learning a new user interface and finding the optimum settings to use, though. Independent tests have shown that FineReader produces more accurate OCR results than Acrobat, although the difference may not great for your scans as the images are good quality. If you start using a camera it could be much more of a consideration.

dtic wrote:Since hard drive space is very, very inexpensive nowadays file size doesn't matter much for many use cases.

I tend to agree with dtic that with 1TB and 2TB drives currently readily available, and looking to the future, the size of the 20GB of files you have now shouldn’t really be a serious concern...

Posted: **02 Sep 2014, 14:41**

I’ve now seen Adobe ClearScan in operation for the first time and I have certainly been impressed with what I have seen. I ran the Fisher... PDF file used in my Abbyy FineReader tests in the previous post through Acrobat XI and the file size was reduced from 13.87MB to 3.76MB. The output quality looks very good in the sample pages I’ve examined, although it hasn’t of course been practical to check for all possible errors or defects in the book.

Fisher... __ClearScan.pdf: (3.76 MiB) Downloaded 590 times

: CS_1.png (82.63 KiB) Viewed 16728 times

In the above image the upper text is from the original scan file and the lower text is from the resulting ClearScan file: notice that the curved baseline at the left side of the original text has been corrected in the ClearScan text. That seems an unexpected and rather impressive enhancement.

: CS_2.png (96.29 KiB) Viewed 16728 times

In the above text the baseline curvature is slightly greater but look carefully at the resulting ClearScan text: the upper three lines have been straighten but the left-most side of the lower two lines is still curved. That’s interesting...

: CS_3.png (20.76 KiB) Viewed 16728 times

In the upper image note that the stray pixels on the outline of the two a’s and two r’s are in different places, as is normal in a scanned image at moderate resolution. The lower image shows the resulting ClearScan text and illustrates the dramatic improvement in visual quality that is one of ClearScan’s advantages. ClearScan synthesises scalable vector fonts to match the original text, so that quality will be maintained when zooming in, just as in text a Word processor document.

In the lower image note that the two a’s and the two r’s look identical, so it is reasonable to assume that ClearType has identified them both as being the same character. Fine, but that inevitably means that ClearType, like JBIG2 Lossy in the previous post, could in principle display an incorrect character if a character in the scan is misidentified, for example if it is poorly formed.

When Adobe introduced ClearScan in Acrobat 9 it seemed a massive step forward, offering the possibility of replacing a large file consisting of scanned images that would not scale well with pages of smooth, scalable vector text, and at the same time greatly reducing file size.

In its early days Adobe was a prominent font company, and ClearScan is a proprietary technology that exploits that background. Adobe has said very little about the technology, basically only revealing that it replaces the characters in a scan with synthesised Type 3 fonts that closely approximate fonts in the original scan, and preserves the page background [when there is a non-white background] using a low-resolution copy. Adobe recommends scanning at a high resolution, ideally 600 DPI, for the best results.

When ClearScan was introduced I assumed, possibly incorrectly, that the text displayed so elegantly on the page was the output from the normal OCR process. As Acrobat’s OCR results at the time on any except good quality scans were often less than perfect, I assumed that it could easily introduce errors into the displayed text when used on lower-quality images, for example from a camera.

Acrobat X reportedly introduced a greatly improved version of ClearType. Looking at the above sample images, it looks as if it uses a combination of methods, displaying the OCR output text when it has confidence in the output, and a synthesised vector image of the text when it is less confident, as in some of the curved text above. On close inspection the reproduction of the page image even extends to the reproduction of black specs from dust in the scanned image, so it looks as if the displayed page can normally be taken as an accurate reproduction of the original.

ClearScan has been criticised in some earlier posts for producing a very large number of fonts which inevitably increase the file size, but this is probably mainly a problem with lower-resolution page images. Rather than the fonts being duplicates because it has failed to recognise portions of text as being in the same font, I suspect that every time it finds a bit pattern it needs to vectorise in order to produce a good quality output, it simply adds it to the current font until it is full, and then starts a new font.

The OCR results seem to be generally very good on these high-quality scan pages, with even words that are displayed as vector images often recognised correctly, and therefore searchable. The OCR results are not, however, entirely free of errors, with some difficult characters missed and some Spanish accents not reproduced, although I did inadvertently run ClearScan with the English language selected, rather than Spanish, or English and Spanish if that is selectable in Acrobat.

So ClearType does have a lot to offer and I’m duly impressed. With good-quality images it should suit most users, but it as well to keep in mind that it does have at least a theoretical possibility of displaying an incorrect character in adverse conditions. That would probably usually not be an issue, but in an academic work in the worst case, a name or date in the displayed text could theoretically be changed.

At the end of the day it is necessary to balance the benefits of greatly improved text quality and a greatly reduced file size against a possible need to spend more time checking the output produced. A simple page image scan compressed using a lossless method can be assumed to be accurate, and can always be referred to in the event of a query later.

Edit:

An incidental advantage of the reduced file size of a ClearScan file is that pages are displayed more quickly when stepping through the pages of a long document on a slower computer, but like the file size reduction, that should rapidly become less significant if computing technology continues to progress at the present pace.

DIY Book Scanner

Post-processing tweaks needed!

Post-processing tweaks needed!

Re: Post-processing tweaks needed!

Re: Post-processing tweaks needed!

Re: Post-processing tweaks needed!

Re: Post-processing tweaks needed!

Re: Post-processing tweaks needed!

Re: Post-processing tweaks needed!

Samples

Re: Post-processing tweaks needed!

Re: Post-processing tweaks needed!