Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

How to create searchable PDF out of scanned files?

Don't know where to start, or stuck on a certain problem? Drop by and tell us about it. Feel like helping others? Start here.
Post Reply
salim_aliya
Posts: 1
Joined: 18 Oct 2015, 18:33
E-book readers owned: PocketBook Sense
Number of books owned: 114
Country: Germany

How to create searchable PDF out of scanned files?

Post by salim_aliya » 21 Oct 2015, 15:38

Hi,

i am really new to this and need some kind of advise on how to continue.

I have scanned two books so far - one has 955 pages and the other one has 1009 pages. I have processed all pages with scan tailor and now i have one TIFF file per page - very clean and ready to assemble in PDF.

Assembling all files into PDF is not the problem. The problem is, how to make searchable PDF out of it.

From early times i bought Abbyy Fine Reader 9.0 and as i could figure out i probably need all font files used in the book to create a perfect searchable PDF?? Because when i have created the PDF i can mark the pages and see that an overlay was made with a complete different font.

How do you create a searchable PDF out of your scanned files?

By the way, i am working on windows 7 x64

ilmarmors
Posts: 15
Joined: 06 Jun 2015, 07:46
Number of books owned: 0
Country: Latvia
Location: Riga, Latvia

Re: How to create searchable PDF out of scanned files?

Post by ilmarmors » 21 Oct 2015, 17:58

I'm using Abbyy Finereader for creating searchable PDFs. I'm using Text under the page image save mode. This option saves the entire page image as a picture and places the recognized text underneath. Use this option to create a fully searchable document that looks virtually the same as the original.
Let me know, if you are interested to get F608ZZ, anti-reflective glass, QLV-1 MR16 GX5.3 socket holders or other Archivist components in Europe.

BruceG
Posts: 67
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: How to create searchable PDF out of scanned files?

Post by BruceG » 22 Oct 2015, 04:03

Hi
I use OmniPage.
Most books use only one set of fonts. Some books tell you what that is. I expect that in Abbyy you can also just select that one font and as long all the characters you want are in the set you have. If there are more than one font you should be able to also select more than one in Abbyy.
I save as text (photos are included) with no no image underneath.
Acrobat can also something similar. I use Acrobat 9. The text layer is not a font as normally understood. So if you are going to copy & paste it does not come out as expected. Editing is difficult in v9. Later versions are better I understand.
If you have more than one book you want to search, Acrobat will create a index of as many pdf files you want. It could index every pdf file on your computer.

The quality of the pdf depends on the quality of the original material. The older the book, cheaper paper, type wear etc all play a part.

recaptcha
Posts: 58
Joined: 03 Sep 2010, 13:23
Number of books owned: 0
Location: Calgary, Alberta, Canada

Re: How to create searchable PDF out of scanned files?

Post by recaptcha » 19 Nov 2015, 01:34

May I ask a silly question on this topic?

Once you've taken the images with the camera/cradle method, and you've put it through scan tailor and/or book scan wizard, and OCR and all the post processing,...what is the final image supposed to look like?

Does it look like a photo of the book page? Or does it look like fresh text against a pure white background as if you had just typed it on a word processor?

I guess I don't quite understand what the post processing and OCR does. Is the text stripped away from the original image and pasted against a blank background?

Could someone post a sample of their work?

BruceG
Posts: 67
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: How to create searchable PDF out of scanned files?

Post by BruceG » 19 Nov 2015, 06:32

Both "looks like a photo of the page " and "a pure white background" are possible. I do the the pure white background method as I started doing it that way and the file size is smaller. File size was important as the project i was working on was 100 years of a Magazine and needed to be searchable all at the same time.
Recently I was given some scans of printed minutes 300-400 pages each. They had been OCRed, (photo type) file size 100-200meg. Putting them through OmniPage (the OCR program I use) they reduced to 1.5-3.4 meg, less than 2% of scans.

What put me off the text over photo method was the uncertain quality of the OCR.
Sample scan Internet Archive Adobe.pdf
(47.97 KiB) Downloaded 551 times
The text looks OK but copy the text into eg word then compare. This book is over 100 years old. New material is much much better.

The following is the white background method. Editing was required.
Sample OCR Internet Archive.pdf
(38.9 KiB) Downloaded 563 times
I would guess there is a use for both, with new material you may not see the difference, where as with material that is old you will. If the material is only to be read it may also make no difference. Is file size important? If you want to search 100's or 1000's of files it maybe. Or emailing.
Do you want to turn the material to a ebook? etc.

recaptcha
Posts: 58
Joined: 03 Sep 2010, 13:23
Number of books owned: 0
Location: Calgary, Alberta, Canada

Re: How to create searchable PDF out of scanned files?

Post by recaptcha » 19 Nov 2015, 18:08

Thanks Bruce.

How was the second sample (white background) achieved? Does this have to be done manually for each page?

I haven't tried scanning anything yet, and am just deciding on whether to go with the camera/cradle method or flatbed scanner (with bookedge).
I don't want to have to do any more post-processing than necessary, and am willing to spend more time on the front end (scanning). It seems the camera/cradle method would involve slightly more post-processing, but I could be wrong.

I think I want my final files to be in .pdf because I'll be scanning a lot of academic and technical books, some with mixed text and charts/photos. So I'll want to keep the formatting of the page. And I definitely want searchable text and the most accurate OCR possible.

BruceG
Posts: 67
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: How to create searchable PDF out of scanned files?

Post by BruceG » 20 Nov 2015, 02:26

yasw was used to crop the page etc. Then OCR (I use OmniPage). Then Infix( a pdf editor ) for text, it can also remove the hue, one page at a time. The hue in/over pictures could not be removed this way. There is a how to in Tutorials to do this with Photoshop. I only have Photoshop Elements so have do one photo at a time instead batch handling. Hue is not a normal a problem if light and camera setting are OK.

Depending on the quality of scan/photo results in more or less OCR editing.
There are free OCR programs I think. Other OCR companies may have trial programs that you could practice with. They handle text charts and pictures well. Paper and print are the determining factor. Technical books may have some small font sizes and special characters that are more difficult.
Fonts can also be a issue. What is available to a printer is not always available free online.

How much time is needed to produce what you require will only be found out with trial and error.
Scanning a 600+ page book, OCR and editing took me 2 days. Is it worth it? Only you can decide.Scanned book in grayscale and then plates, covers etc in colour. Replace grayscale page with the colour page. It all takes time.

Happy to do a page or so, for you to see what can/cannot be done.

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest