DRAFT MUG's GUIDE: Converting old books to PDF
Posted: 09 Jan 2013, 15:21
For someone wanting to do volume digitisations of old books.... draft only so far, it occurred to me to ask for comments here to weed out any obvious errors first - and in case anyone has a proven solution to the problem in #6, which I'm tentatively thinking may be fixed using the alternative image to PDF application below noted:
-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0
SUBJECT: How to make sensibly sized PDFs (with searchable text layer) of old books using free software.
The objective of this note is to provide an outline guide on how to produce reliable digitisations of old books or other documents without using a proprietary optical character recognition (ocr) software package. All applications are for MS Windows but I think all or some of the apps named have Mac and *ix versions or analogs.
PDF documents come in various flavours. I haven't examined the formal specs. but I've worked with/ created pdfs which comprise
-- images only of the paper pages of a book or document;
-- images of the paper pages with a searchable/copiable text layer hidden underneath;
-- images of the paper pages with a searchable/copiable text layer on top, either where the ocr program found the characters indistinct or uniformally for all irrespective of recognisability;
-- text (and graphics among the text, if any) only, without images of the paper pages; makes for much smaller pdf files, but requires a lot of careful work to get the right result.
There are lots of good (mainly pdf) scans of books, documents etc out there but many times the result could have been better. For instance, some pdfs are so large they cannot be read on portable ereaders successfully or at all. Or they may just be images of pages without searchable text; or text-only PDFs which have not been diligently proofed so they omit material, do not reflect original formatting and so on.
Conversely, making a good scanned document that contains only text and pictures (if any) rather than images of the pages themselves can be tricky and has a fairly steep learning curve. Also the results are rarely perfect; words get missed out or munged. You need to spend time at it, particularly with proofing, and some might say that can be accomplished only by one person reading the paper original text aloud while a second person follows the written form of the ocr copy (or by using TTS readers or some other workaround).
So instead of all that, here is an alternative that works for me:
1. scan books and other similar documents that are mainly text and not images (or not colour images) at 300dpi. I have found that anything higher generally confuses ocr programs and results in stupidly big files. Don't scan text in color. And don't usually scan text in grayscale, I've never found it increases ocr efficiency or accuracy.
2. Make sure your scanning application allows you to save scans of each page of the book or other item as .tiff or (maybe) jpeg. In other words an application that discards the images of the pages and does auto-ocr resulting in your getting only editable text is a waste of time for these purposes.
3. Clean up the image files using ScanTailor from Sourceforge.net, see also pages about it at http://www.diybookscanner.org/forum/index.php . By clean up I mean rendering the images better for more accurate ocr and smaller file size by for example:
- rendering the page sizes of uniform size;
- correcting horizontal / vertical misalignment;
- removing noise like black margins, handwritten notes etc;
- splitting into 2 halves, pages which have been scanned from an open book so there are two pages side by side;
- despeckling;
- dewarping text;
- reducing size of pages that have mixed text and pictorial content by applying the mixed mode in the application so only the picture areas are rendered in gray or color as applicable, the text being rendered in b&w only.
-- and so on. Make sure you change the output file from default 600dpi to 300dpi and unless the text is very spotted, turn off the despeckling since sometimes it makes commas into periods.
Basic tutorial http://vimeo.com/12524529
4. Assemble the tiff files output by ScanTailor into a pdf file using a free utility like ImPDF library
http://www.comsquare.ch/files/downloads/ImPDF_0_90.zip This contains a .dll which is packaged as one of the plugins for the graphics program
Irfanview http://www.irfanview.com/ Note the version packaged with the plugins at http://www.irfanview.com/plugins.htm is out of date so preferably get it from the comsquare website and install to the plugins folder of Irfanview.
In Irfanview, menu: Options / Multipage Images / Create Multipage PDF (Plugin)
The plugin seems to have a minor bug in as much as it won't allow you to select more than ~200 image files at once. So just select <200 and then repeat the add images step as necessary.
[NOTE: problem / downside remarked on in #6 below concerning legibility on portable ereaders of pdfs made with this application; at time of posting, testing still under way, but a better alternative may be to use the free Convert Images to PDF tool http://www.pdfill.com/pdf_tools_free.html ]
5. What you will then have is a PDF comprising images of the pages of the book/document. But it won't be searchable or copyable as text since there is no text layer. Generally you need an OCR program for that, and specifically, one that can output in pdf format with text alone or text under images. But if you install the free PDF-XChange PDF Viewer from http://www.tracker-software.com/ you will find it has a rudimentary ocr function that will ocr an existing image-pdf and add a text layer under the page image. I've experimented using it in high accuracy mode and the results have been good provided the input image files are tidied up well (step #3 above). They won't be perfect since you cannot review or change the text layer but it's better than nothing. And you have the image of the page itself so if you need to copy a part and its text layer has an error in it you can edit that manually in your Windows clipboard / other app's copy of the text.
6. The only downside I've found to this so far is that while the resulting pdf files can be read fine on a Windows platform using Acrobat and PDF-XChange Viewer for instance, they come up as blank pages on certain portable reading devices. This seems a common problem for many pdfs from different sources and may be due to the reader's pdf reader being to older standards. For instance I've read posts that say you can fix this error by producing the pdf without jpeg2000 compression. [Please post a reply if you have tested any free application that can fix this. I've tried writing the pdf via a convert to pdf print driver like cutepdf or PDFCreator but they strip out the text layer (they only print the layer that is visible on screen i.e. the image of the paper page) and even then the output still came up blank on the portable reader. Running cat and other steps in pdftk also didn't seem to have any effect.
Potential fix: still testing, but if in stage #4 above you combine the page images into a pdf file using "Convert Images to PDF" tool in http://www.pdfill.com/pdf_tools_free.html this may be a fix. I have tested using one file which I then ocr'd as per step #5 and iirc it worked, although several files were produced for testing and I may have lost track! So any known working solutions - please post.]
There are free ocr programs out there like these but as far as I know none will make a pdf of page images with text underneath. Most produce just ocr text and that with results of variable accuracy.
1) Cuneiform EN V12 (Windows)
http://cognitiveforms.ru/downloads/setu ... orm_en.exe
Generally good recognition of page layout, paragraphs etc. Clunky so far as e.g. instal routine didn't produce properly named Start menu shortcuts and it won't seem to allow adding multiple image files in a manual operation, only in batch. Will not output as pdf so no text under image pdf for example.
Neverthless in my view probably the best and the one free ocr program that should have development effort concentrated on it rather than going with independent projects producing less than mediocre results.
2) SimpleOCR (Windows)
http://www.simpleocr.com/Download.asp
InstSocr.exe
3) FreeOCR v.4.2 (Windows)
http://www.paperfile.net/
http://www.paperfile.net/freeocr.exe
Tesseract (v3.01) OCR engine. It includes a Windows installer and gui.
4) Tesseract
http://code.google.com/p/tesseract-ocr/
http://sourceforge.net/projects/vietocr/ (Example of a front-end). Very poor for recognition of paragraph layout in my experience.
5) KADMOS plugin for Irfanview
http://www.heise.de/download/kadmos-icrocr-sdk.html -- ocr plugin for Irfanview, outputs plain text, not greatly accurate.
6) OCRopus(tm) open source document analysis and OCR system
http://code.google.com/p/ocropus/
From my experience you will waste much time using these for anything but the simplest of tasks.
-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0
EDIT: lol! It's always the way, but having read here a few times over the months and then cobbled the above together, I find HOMER
http://bookscanner.pbworks.com/w/page/4 ... /FrontPage
-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0
SUBJECT: How to make sensibly sized PDFs (with searchable text layer) of old books using free software.
The objective of this note is to provide an outline guide on how to produce reliable digitisations of old books or other documents without using a proprietary optical character recognition (ocr) software package. All applications are for MS Windows but I think all or some of the apps named have Mac and *ix versions or analogs.
PDF documents come in various flavours. I haven't examined the formal specs. but I've worked with/ created pdfs which comprise
-- images only of the paper pages of a book or document;
-- images of the paper pages with a searchable/copiable text layer hidden underneath;
-- images of the paper pages with a searchable/copiable text layer on top, either where the ocr program found the characters indistinct or uniformally for all irrespective of recognisability;
-- text (and graphics among the text, if any) only, without images of the paper pages; makes for much smaller pdf files, but requires a lot of careful work to get the right result.
There are lots of good (mainly pdf) scans of books, documents etc out there but many times the result could have been better. For instance, some pdfs are so large they cannot be read on portable ereaders successfully or at all. Or they may just be images of pages without searchable text; or text-only PDFs which have not been diligently proofed so they omit material, do not reflect original formatting and so on.
Conversely, making a good scanned document that contains only text and pictures (if any) rather than images of the pages themselves can be tricky and has a fairly steep learning curve. Also the results are rarely perfect; words get missed out or munged. You need to spend time at it, particularly with proofing, and some might say that can be accomplished only by one person reading the paper original text aloud while a second person follows the written form of the ocr copy (or by using TTS readers or some other workaround).
So instead of all that, here is an alternative that works for me:
1. scan books and other similar documents that are mainly text and not images (or not colour images) at 300dpi. I have found that anything higher generally confuses ocr programs and results in stupidly big files. Don't scan text in color. And don't usually scan text in grayscale, I've never found it increases ocr efficiency or accuracy.
2. Make sure your scanning application allows you to save scans of each page of the book or other item as .tiff or (maybe) jpeg. In other words an application that discards the images of the pages and does auto-ocr resulting in your getting only editable text is a waste of time for these purposes.
3. Clean up the image files using ScanTailor from Sourceforge.net, see also pages about it at http://www.diybookscanner.org/forum/index.php . By clean up I mean rendering the images better for more accurate ocr and smaller file size by for example:
- rendering the page sizes of uniform size;
- correcting horizontal / vertical misalignment;
- removing noise like black margins, handwritten notes etc;
- splitting into 2 halves, pages which have been scanned from an open book so there are two pages side by side;
- despeckling;
- dewarping text;
- reducing size of pages that have mixed text and pictorial content by applying the mixed mode in the application so only the picture areas are rendered in gray or color as applicable, the text being rendered in b&w only.
-- and so on. Make sure you change the output file from default 600dpi to 300dpi and unless the text is very spotted, turn off the despeckling since sometimes it makes commas into periods.
Basic tutorial http://vimeo.com/12524529
4. Assemble the tiff files output by ScanTailor into a pdf file using a free utility like ImPDF library
http://www.comsquare.ch/files/downloads/ImPDF_0_90.zip This contains a .dll which is packaged as one of the plugins for the graphics program
Irfanview http://www.irfanview.com/ Note the version packaged with the plugins at http://www.irfanview.com/plugins.htm is out of date so preferably get it from the comsquare website and install to the plugins folder of Irfanview.
In Irfanview, menu: Options / Multipage Images / Create Multipage PDF (Plugin)
The plugin seems to have a minor bug in as much as it won't allow you to select more than ~200 image files at once. So just select <200 and then repeat the add images step as necessary.
[NOTE: problem / downside remarked on in #6 below concerning legibility on portable ereaders of pdfs made with this application; at time of posting, testing still under way, but a better alternative may be to use the free Convert Images to PDF tool http://www.pdfill.com/pdf_tools_free.html ]
5. What you will then have is a PDF comprising images of the pages of the book/document. But it won't be searchable or copyable as text since there is no text layer. Generally you need an OCR program for that, and specifically, one that can output in pdf format with text alone or text under images. But if you install the free PDF-XChange PDF Viewer from http://www.tracker-software.com/ you will find it has a rudimentary ocr function that will ocr an existing image-pdf and add a text layer under the page image. I've experimented using it in high accuracy mode and the results have been good provided the input image files are tidied up well (step #3 above). They won't be perfect since you cannot review or change the text layer but it's better than nothing. And you have the image of the page itself so if you need to copy a part and its text layer has an error in it you can edit that manually in your Windows clipboard / other app's copy of the text.
6. The only downside I've found to this so far is that while the resulting pdf files can be read fine on a Windows platform using Acrobat and PDF-XChange Viewer for instance, they come up as blank pages on certain portable reading devices. This seems a common problem for many pdfs from different sources and may be due to the reader's pdf reader being to older standards. For instance I've read posts that say you can fix this error by producing the pdf without jpeg2000 compression. [Please post a reply if you have tested any free application that can fix this. I've tried writing the pdf via a convert to pdf print driver like cutepdf or PDFCreator but they strip out the text layer (they only print the layer that is visible on screen i.e. the image of the paper page) and even then the output still came up blank on the portable reader. Running cat and other steps in pdftk also didn't seem to have any effect.
Potential fix: still testing, but if in stage #4 above you combine the page images into a pdf file using "Convert Images to PDF" tool in http://www.pdfill.com/pdf_tools_free.html this may be a fix. I have tested using one file which I then ocr'd as per step #5 and iirc it worked, although several files were produced for testing and I may have lost track! So any known working solutions - please post.]
There are free ocr programs out there like these but as far as I know none will make a pdf of page images with text underneath. Most produce just ocr text and that with results of variable accuracy.
1) Cuneiform EN V12 (Windows)
http://cognitiveforms.ru/downloads/setu ... orm_en.exe
Generally good recognition of page layout, paragraphs etc. Clunky so far as e.g. instal routine didn't produce properly named Start menu shortcuts and it won't seem to allow adding multiple image files in a manual operation, only in batch. Will not output as pdf so no text under image pdf for example.
Neverthless in my view probably the best and the one free ocr program that should have development effort concentrated on it rather than going with independent projects producing less than mediocre results.
2) SimpleOCR (Windows)
http://www.simpleocr.com/Download.asp
InstSocr.exe
3) FreeOCR v.4.2 (Windows)
http://www.paperfile.net/
http://www.paperfile.net/freeocr.exe
Tesseract (v3.01) OCR engine. It includes a Windows installer and gui.
4) Tesseract
http://code.google.com/p/tesseract-ocr/
http://sourceforge.net/projects/vietocr/ (Example of a front-end). Very poor for recognition of paragraph layout in my experience.
5) KADMOS plugin for Irfanview
http://www.heise.de/download/kadmos-icrocr-sdk.html -- ocr plugin for Irfanview, outputs plain text, not greatly accurate.
6) OCRopus(tm) open source document analysis and OCR system
http://code.google.com/p/ocropus/
From my experience you will waste much time using these for anything but the simplest of tasks.
-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0
EDIT: lol! It's always the way, but having read here a few times over the months and then cobbled the above together, I find HOMER
http://bookscanner.pbworks.com/w/page/4 ... /FrontPage