DRAFT MUG's GUIDE: Converting old books to PDF

Share your software workflow. Write up your tips and tricks on how to scan, digitize, OCR, and bind ebooks.

Moderator: peterZ

Post Reply
b0bcat
Posts: 49
Joined: 30 Nov 2012, 21:37
Number of books owned: 0
Country: UK

DRAFT MUG's GUIDE: Converting old books to PDF

Post by b0bcat »

For someone wanting to do volume digitisations of old books.... draft only so far, it occurred to me to ask for comments here to weed out any obvious errors first - and in case anyone has a proven solution to the problem in #6, which I'm tentatively thinking may be fixed using the alternative image to PDF application below noted:

-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0

SUBJECT: How to make sensibly sized PDFs (with searchable text layer) of old books using free software.

The objective of this note is to provide an outline guide on how to produce reliable digitisations of old books or other documents without using a proprietary optical character recognition (ocr) software package. All applications are for MS Windows but I think all or some of the apps named have Mac and *ix versions or analogs.

PDF documents come in various flavours. I haven't examined the formal specs. but I've worked with/ created pdfs which comprise

-- images only of the paper pages of a book or document;

-- images of the paper pages with a searchable/copiable text layer hidden underneath;

-- images of the paper pages with a searchable/copiable text layer on top, either where the ocr program found the characters indistinct or uniformally for all irrespective of recognisability;

-- text (and graphics among the text, if any) only, without images of the paper pages; makes for much smaller pdf files, but requires a lot of careful work to get the right result.

There are lots of good (mainly pdf) scans of books, documents etc out there but many times the result could have been better. For instance, some pdfs are so large they cannot be read on portable ereaders successfully or at all. Or they may just be images of pages without searchable text; or text-only PDFs which have not been diligently proofed so they omit material, do not reflect original formatting and so on.

Conversely, making a good scanned document that contains only text and pictures (if any) rather than images of the pages themselves can be tricky and has a fairly steep learning curve. Also the results are rarely perfect; words get missed out or munged. You need to spend time at it, particularly with proofing, and some might say that can be accomplished only by one person reading the paper original text aloud while a second person follows the written form of the ocr copy (or by using TTS readers or some other workaround).

So instead of all that, here is an alternative that works for me:

1. scan books and other similar documents that are mainly text and not images (or not colour images) at 300dpi. I have found that anything higher generally confuses ocr programs and results in stupidly big files. Don't scan text in color. And don't usually scan text in grayscale, I've never found it increases ocr efficiency or accuracy.

2. Make sure your scanning application allows you to save scans of each page of the book or other item as .tiff or (maybe) jpeg. In other words an application that discards the images of the pages and does auto-ocr resulting in your getting only editable text is a waste of time for these purposes.

3. Clean up the image files using ScanTailor from Sourceforge.net, see also pages about it at http://www.diybookscanner.org/forum/index.php . By clean up I mean rendering the images better for more accurate ocr and smaller file size by for example:

- rendering the page sizes of uniform size;
- correcting horizontal / vertical misalignment;
- removing noise like black margins, handwritten notes etc;
- splitting into 2 halves, pages which have been scanned from an open book so there are two pages side by side;
- despeckling;
- dewarping text;
- reducing size of pages that have mixed text and pictorial content by applying the mixed mode in the application so only the picture areas are rendered in gray or color as applicable, the text being rendered in b&w only.

-- and so on. Make sure you change the output file from default 600dpi to 300dpi and unless the text is very spotted, turn off the despeckling since sometimes it makes commas into periods.

Basic tutorial http://vimeo.com/12524529

4. Assemble the tiff files output by ScanTailor into a pdf file using a free utility like ImPDF library
http://www.comsquare.ch/files/downloads/ImPDF_0_90.zip This contains a .dll which is packaged as one of the plugins for the graphics program

Irfanview http://www.irfanview.com/ Note the version packaged with the plugins at http://www.irfanview.com/plugins.htm is out of date so preferably get it from the comsquare website and install to the plugins folder of Irfanview.
In Irfanview, menu: Options / Multipage Images / Create Multipage PDF (Plugin)
The plugin seems to have a minor bug in as much as it won't allow you to select more than ~200 image files at once. So just select <200 and then repeat the add images step as necessary.
[NOTE: problem / downside remarked on in #6 below concerning legibility on portable ereaders of pdfs made with this application; at time of posting, testing still under way, but a better alternative may be to use the free Convert Images to PDF tool http://www.pdfill.com/pdf_tools_free.html ]

5. What you will then have is a PDF comprising images of the pages of the book/document. But it won't be searchable or copyable as text since there is no text layer. Generally you need an OCR program for that, and specifically, one that can output in pdf format with text alone or text under images. But if you install the free PDF-XChange PDF Viewer from http://www.tracker-software.com/ you will find it has a rudimentary ocr function that will ocr an existing image-pdf and add a text layer under the page image. I've experimented using it in high accuracy mode and the results have been good provided the input image files are tidied up well (step #3 above). They won't be perfect since you cannot review or change the text layer but it's better than nothing. And you have the image of the page itself so if you need to copy a part and its text layer has an error in it you can edit that manually in your Windows clipboard / other app's copy of the text.

6. The only downside I've found to this so far is that while the resulting pdf files can be read fine on a Windows platform using Acrobat and PDF-XChange Viewer for instance, they come up as blank pages on certain portable reading devices. This seems a common problem for many pdfs from different sources and may be due to the reader's pdf reader being to older standards. For instance I've read posts that say you can fix this error by producing the pdf without jpeg2000 compression. [Please post a reply if you have tested any free application that can fix this. I've tried writing the pdf via a convert to pdf print driver like cutepdf or PDFCreator but they strip out the text layer (they only print the layer that is visible on screen i.e. the image of the paper page) and even then the output still came up blank on the portable reader. Running cat and other steps in pdftk also didn't seem to have any effect.
Potential fix: still testing, but if in stage #4 above you combine the page images into a pdf file using "Convert Images to PDF" tool in http://www.pdfill.com/pdf_tools_free.html this may be a fix. I have tested using one file which I then ocr'd as per step #5 and iirc it worked, although several files were produced for testing and I may have lost track! So any known working solutions - please post.]
There are free ocr programs out there like these but as far as I know none will make a pdf of page images with text underneath. Most produce just ocr text and that with results of variable accuracy.

1) Cuneiform EN V12 (Windows)
http://cognitiveforms.ru/downloads/setu ... orm_en.exe
Generally good recognition of page layout, paragraphs etc. Clunky so far as e.g. instal routine didn't produce properly named Start menu shortcuts and it won't seem to allow adding multiple image files in a manual operation, only in batch. Will not output as pdf so no text under image pdf for example.
Neverthless in my view probably the best and the one free ocr program that should have development effort concentrated on it rather than going with independent projects producing less than mediocre results.

2) SimpleOCR (Windows)
http://www.simpleocr.com/Download.asp
InstSocr.exe

3) FreeOCR v.4.2 (Windows)
http://www.paperfile.net/
http://www.paperfile.net/freeocr.exe
Tesseract (v3.01) OCR engine. It includes a Windows installer and gui.

4) Tesseract
http://code.google.com/p/tesseract-ocr/
http://sourceforge.net/projects/vietocr/ (Example of a front-end). Very poor for recognition of paragraph layout in my experience.

5) KADMOS plugin for Irfanview
http://www.heise.de/download/kadmos-icrocr-sdk.html -- ocr plugin for Irfanview, outputs plain text, not greatly accurate.

6) OCRopus(tm) open source document analysis and OCR system
http://code.google.com/p/ocropus/


From my experience you will waste much time using these for anything but the simplest of tasks.

-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0


EDIT: lol! It's always the way, but having read here a few times over the months and then cobbled the above together, I find HOMER
http://bookscanner.pbworks.com/w/page/4 ... /FrontPage
User avatar
daniel_reetz
Posts: 2812
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: DRAFT MUG's GUIDE: Converting old books to PDF

Post by daniel_reetz »

Homer is truly fantastic, if you could post your experiences with it that would be GREAT.
b0bcat
Posts: 49
Joined: 30 Nov 2012, 21:37
Number of books owned: 0
Country: UK

Re: DRAFT MUG's GUIDE: Converting old books to PDF

Post by b0bcat »

1. Homer - this is odd, getting a presumably false positive with
http://www.microsoft.com/security/scann ... fault.aspx
Microsoft Safety Scanner 1.0.3001.0
which complained that cmdow.exe (homer v1.0beta1) was HackTool:Win32/Cmdow.A

2. just received a pdf apparently recoded with JBIG2 in Acrobat Pro and it works fine on my Sony Reader PRS-650 so when time permits I must try the pdfbeads pdf creation routine. The pdfs created with the Irfanview plugin uniformly failed to work on the Sony reader.

3. am informed that the Free OCR included with the Free PDF-XChange Viewer "will be surpassed in functionality with the release of a planned licensed OCR plugin for the the new PDF-XChange Editor later this year after the Editor is released in March".
b0bcat
Posts: 49
Joined: 30 Nov 2012, 21:37
Number of books owned: 0
Country: UK

Re: DRAFT MUG's GUIDE: Converting old books to PDF

Post by b0bcat »

gImageReader 3.2.3 for MS Windows and Linux

https://sourceforge.net/projects/gimage ... =directory
https://github.com/manisandro/gImageReader/wiki

An interesting development in gImageReader 3.2.2 (Jun 30 2017) (current version now at v.3.2.3):
"* Attempt to use original source image for PDF output"
This means you can feed it an image (non-ocr, non-text) pdf and it will perform ocr using tesseract to produce a searchable text layer, outputting the input pdf with said layer incorporated. I tried it with a single page pdf and it worked, though the word spacing in parts seemed to indicate a need for further work. Good enough for word search anyway and in any case the program already permits output alternatively as text only.
Another free MS Windows program to add to the closed-source
https://www.tracker-software.com/produc ... nge-viewer
https://www.tracker-software.com/produc ... nge-editor
for adding an ocr layer.
blauer
Posts: 11
Joined: 11 Mar 2015, 09:53
Number of books owned: 568
Country: Austria
Contact:

Converting old books to PDF

Post by blauer »

I am working on a Mac and until step 4 everything works just the same on MacOS for me.
For creating the PDF though I find it nicest to use the preinstalled "preview" app. I follow this procedure.

Create .tiff files as b0bcat told us, then Step 4. :

a. Select all single page .tiff"s in finder, right click --> open in preview.
b. Select all opened pictures in the bar on the left (select one by clicking --> cmd + a)
c. File --> Print
choose paper size --> custom size --> put in book size
choose: "Adjust Size:" --> "Print whole Picture"
d. In the lower left corner choose "send PDF to iBooks"
e. In iBooks choose "List" and put in Meta Data
f. right click open with your OCR Program

I am using a friends Adobe Acrobat Pro license and clear scan or to do so. I had some troubles with clear scan on older devices --> "Searchable Picture"-mode worked fine in all cases.
I know Adobe Acrobat Pro is not a cheap Program, but if you get a chance to use it, do so.

Does some one know a good free ocr-layer-adding software on mac?
The Dokuments i get from this procedure perform very well on a pretty old kindle and a tolino shine.

And why I actually first answered: I tried the gImageReader and opened the resulting PDF with MAC's Preview. The Text turned into complete nonsense. It turned out that under latest MacOS Sierra preview messes up many OCR'd PDFs!
So be careful when opening OCR'd PDF's with preview on a mac, preview may mess them up!

Greetings from Austria,
David
http://www.archivar.net - archivar is a german diy-scanner based on archivist
zbgns
Posts: 61
Joined: 22 Dec 2016, 06:07
E-book readers owned: Tolino, Kindle
Number of books owned: 600
Country: Poland

Re: DRAFT MUG's GUIDE: Converting old books to PDF

Post by zbgns »

It is worth mentioning that Tesseract can produce pdfs with text layer on its own. It deals with multipage tiffs. So after creating multipage tiff, you may generate pdf with both image and text layer by using command like this:

Code: Select all

Tesseract -l eng input.tif output pdf
gImageReader is the best Tesseract GUI frontend for both Linux and Windows I know, but the above mentioned procedure is usually much less troublesome for creating searchable pdf (provided that you have no command line phobia).
Adobe Acrobat Pro can also do the job, but Tesseract’s OCR accuracy is usually better, especially if quality of input images is not perfect.
b0bcat
Posts: 49
Joined: 30 Nov 2012, 21:37
Number of books owned: 0
Country: UK

Re: DRAFT MUG's GUIDE: Converting old books to PDF

Post by b0bcat »

Well, time passes and 'solvitur ambulando', one learns by doing...

Glancing at this (draft) Guide of mine from back in early 2013 two material factors occur to mind from my experience gained in imaging a couple of decayed old books each year since then:

1) I recant! I recant! "#1. And don't usually scan text in grayscale, I've never found it increases ocr efficiency or accuracy." That no longer represents my thinking, especially for foxed, faded old books in particular, for the reasons indicated in other threads on the forum. Even though time has permitted me only imaging of a handful of books I have discovered that the saving of space (b&w -vs- greyscale) is no longer an issue at all in modern conditions of cheap storage. Scanning in greyscale has at least 2 advantages: (i) generally, a "nicer" visual aspect than b&w although at the expense of bigger file size, but it allows still the option of saving output from ScanTaylor as b&w if smaller pdf size is required and if the result is acceptable and (ii) greyscale images of pages that have marks, stains, discolorations, differentially too dark or too light areas of the same page (reflecting the paper original), dark shadow background etc can be corrected using layers in e.g. GIMP before ScanTailor processing (or after). This fact is sufficiently important for me to memorialize it here now even if it is stated in other threads.

Specifically, I have had quite a steep learning curve with GIMP, a process not helped by most if not all online tutorials being addressed to purposes having nothing to do with cleaning a damaged page of text for imaging and/OCR. There are however several tutorial books and one from my local library at least helped get me started. So now I can fix basic issues like correcting localized text warp, making uniform the exposure of a page image that had under-exposed and over-exposed parts reflecting unequal inking of the original paper, and digitally editing the ink in words and letters where perhaps even the inked paper page had omitted them owing to cheap/poor printing methodology as well as later decay. Also resizing pages, fitting to canvas size etc (though these last two I find easier in Irfanview).

2) The advantages of outputting in DjVu format if one doesn't have access to a commercial program like the modern versions of Adobe Acrobat Pro or another program which can create pdf files of acceptable size from greyscale images. For details see thread
viewtopic.php?f=20&t=3339
Make a djvu file and add ocr: DjVuToy; TiffDjvuOcr; CuneiDjVu

As will be seen from that thread, using free tools (MS Windows) one can not only create rather small DjVu files using even greyscale tiff files with greater definition (e.g. 400dpi up if needed) but a searchable sub-image text layer can be included, again using free software as therein shown. And the problem of creating a DjVu file including one or more pages of mixed text and pictures (otherwise resulting sometimes in artifacts spoiling the picture part) can be worked round by saving such page(s) e.g. in DjVuSolo as photo (as opposed to e.g. scanned, perfect) and then substituting them (for any such pages that may have been DjVu encoded as 'scanned' or other default modes) using the Edit function of e.g. DjVuToy. (This workaround being an inferior means to a similar (in result) end as the djvu_imager and djvu_small application suite[*], which I found had more steps to learn before practical implementation).

[*] http://www.djvu-soft.narod.ru/scan/djvu_imager_en.htm

I don't know if licensing issues affect the permitted use of DjVu format by e.g. archive.org but their Luradoc compressed pdf files I find are a very inferior substitute; even a good multi-format reader like SumatraPDF stalls and halts in page turning while it labours to decompress whereas in my experience DjVu files scroll smoothly without such hesitation.

Last, I find DjVu metadata can now be viewed/edited using an MS Windows explorer extension:
https://www.cuminas.jp/en/downloads
DjVu Shell Extension Pack

I haven't yet tested whether Phil Harvey's updated exiftool can operate likewise on a DjVu file:
https://www.sno.phy.queensu.ca/~phil/exiftool/
Post Reply