Page 2 of 2

Re: Noob Questions on Scanning Process and E-Reader Formats

Posted: 24 Jun 2013, 03:22
by dtic
Some tools for turning images into pdf have a OCR step built in. For example Adobe Acrobat. With such a tool OCR'ing is included your step 4. Otherwise you do OCR after the pdf/djvu is created and use some tool to insert the OCR'ed text into the file.

Re: Noob Questions on Scanning Process and E-Reader Formats

Posted: 24 Jun 2013, 19:58
by rkomar
I converted a lot of my old textbooks and references into PDF files a few years ago. At the time, mathematical equations, computer code snippets and tables were very hard to put into EPUBs. So, I just left them as scanned images inside the PDF files. I also found that such a bare-bones document was very hard to use as a reference. I ended up adding all of the chapters and sections to the "bookmarks" section in each file. _That_ turned out to be as much work as everything else when dealing with books with detailed contents (computing the page offsets, typing in the text for each,...). Still, it was needed if I wanted to be able to find information easily in the documents. You can use OCR to add a text layer to the document and search that when looking for information, but I personally don't think that's as good as having a table of contents.

Re: Noob Questions on Scanning Process and E-Reader Formats

Posted: 25 Jun 2013, 11:23
by dtic
@rkomar: Yeah, manual bookmarking takes a lot of time. I posted a script that, combined with jpdfbookmarks, speeds up bookmark creation a lot. See this thread http://diybookscanner.org/forum/viewtop ... =19&t=2837 , especially post number 4.

Re: Noob Questions on Scanning Process and E-Reader Formats

Posted: 25 Jun 2013, 21:14
by recaptcha
So if I want to have a lot of searchable reference books and articles on a tablet/e-reader, what would you recommend in terms of saving processing time? It's starting to sound like a major undertaking.

Re: Noob Questions on Scanning Process and E-Reader Formats

Posted: 26 Jun 2013, 11:34
by dtic
If you're on Windows and have access to Acrobat (sounded like you had before) then I suggest you start out simple:
1. start out with a simple DIY cardboard scanner and a sheet of glass/plastic
2. run the book page photos through Scan Tailor
3. then turn the images into an OCR'ed pdf in Acrobat (try the Clearscan OCR setting)

Once you get the hang of it you can add more steps, test out different software here, build a scanner that is faster to operate and so on.
Do save a backup of all unedited book photos from the start. That way you can always reprocess at a later time when you have more experience with the different settings and can add additional postprocessing steps.