Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

Add hidden text layer from different pdf files

Just what it says.
Post Reply
jose2046

Add hidden text layer from different pdf files

Post by jose2046 » 22 Aug 2011, 12:22

Hi All,

I am starting with fixing some books I have Photographed some time ago and in the same way I am starting to get familiar with some technical processes.

My question is:

I have a tailored pdf file with the pictures of a book a file near of 160M (the book have almost 350 pages, some of them (near 50) images in gray scale.

I have used a OCR software to make a searchable pdf document from the previous file but the size is quite big for storage.

I have made a OCR just text pdf and I change all the master pictures format, size and resolution to diminish the file size. The images are really good for screen reading and the size change considerably, but not good enough to get a good OCR text from them. So, I have two pdf documents: One with the OCR text result and Another one with the "small" picture files.

I would like to embed the text pdf as a hidden layer or similar in to the another pdf that has the small images, but I couldn't fnd how to do it. Any help will be appreciated!

Thanks for reading!

My best,

J

User avatar
Misty
Posts: 481
Joined: 06 Nov 2009, 12:20
Number of books owned: 0
Location: Frozen Wasteland

Re: Add hidden text layer from different pdf files

Post by Misty » 22 Aug 2011, 13:08

This is tricky to do with free software, but it can be done. PDFBeads can make tiny PDFs with OCR text layers.

One of the problems with making PDFs is that compressed images in greyscale or colour are huge - very huge. It's very difficult to get them down to an acceptable size. That's why many people use Scan Tailor. It reduces pages to pure black and white, which compresses very well. With the right tools, you can make books of hundreds of pages into a PDF under 10MB, with OCR.

The other problem is that not all OCR tools give you an output that's suitable to be used by other software this way, so PDFBeads requires that you use the Tesseract or Cuneiform open-source OCR programs.
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.

jose2046

Re: Add hidden text layer from different pdf files

Post by jose2046 » 23 Aug 2011, 12:36

Thanks Misty for the quick reply.

I was working with Scan Tailor, but in my case the B/W images are very poor and as some of the book are from art studies or similar, the images ar really important...

I understand what you say, but I was wondering if there is any way to embed one pdf file into another one (page by page with the same number of pages, same size, etc) as a hidden layer or a transparent one. I have Acrobat and Nitro but I haven't be able to find out how to get it.

Thanks again,

Kind Regs.

jose2046

Re: Add hidden text layer from different pdf files

Post by jose2046 » 08 Sep 2011, 19:18

Again me.

Well, i decided to make something really handy but it works.

The situation is:
I have:
- OCR only-text PDF document
- folder with the only-text B&W bitmap images from Scan Tailor full res
- folder with the scans of pages that have images in Mixed from Scan Taylor but converted to smaller jpeg medium quality files (1 image from the Tailor is near 4M in this jpeg is near 200K)

The objective is to make a searchable PDF.

I worked on PDF Nitro.

What I did (too handy i know) is to open the OCR PDF file and start to put each image over each page (they fix because of the size and resolution of each image from tailor)

Save the PDF when finished and it was a near 70M 200-pages-book...quite big for too much work, BUT I save it again as another PDF file with diffrent name and it was 11M without changing anything of it!

I do not know why the Nitro works in this way but the result worth it.

I write it just for sharing. But if somebody know of a program that can batch the thing of importing the (numbered) images from a folder in to a PDF following the order it'd be great to now about it!!!

My best!

JC

the.traveller
Posts: 73
Joined: 22 Sep 2010, 03:58
E-book readers owned: Samsung Tab S
Number of books owned: 500
Country: Netherlands
Location: Rotterdam, Netherlands

Re: Add hidden text layer from different pdf files

Post by the.traveller » 13 Sep 2011, 17:51

I wonder if you have the original pictures what will happen if;

A: You use comical to use the original photo's to make a zip or rar file of them. Will zipping them make it much smaller.
It won't be searchable.

B: You use the commercial OCR software ABBYY Finereader to scan your magazines and have a searchable magazine with full colour pictures. The text will be clickable so you can jump from the index to the desired article/page.

C: You downscale the number of bits from your pictures to 8 bits with photoshop? In photoshop, and probably in other photo products also, you can save pictures in a webquality which make them smaller but still nice to see.

With Acrobat Pro you can have pdf's for seperate chapters of your book and one pdf which has an index to all the seperate pdf's. Might that work for you?

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest