Page 1 of 2

Converting PDF to ePub Calibre not working

Posted: 25 Jun 2018, 22:24
by BryanMatt
Hello, I have a 227 page book that I scanned on a flat bed scanner. I broke up the book into pages. I scanned it at 600 dpi greyscale and used ScanTailor to convert those pages to .tiff files. Then I converted those files into a pdf with Adobe acrobat Pro 11 and ran OCR. The book is now searchable in adobe or other readers that I've tried. I wanted to convert it to an ePub using Calibre but everytime I run it through Calibre it converts each pdf page to an image so all the OCR is lost. Not sure if anyone is familiar with Calibre settings and I'm probably doing it wrong. But I have converted other pdf books to epub.

Do I need to upload some sample images? Not sure what to upload here so that people know what to advise.

Thank you in advance.

Re: Converting PDF to ePub Calibre not working

Posted: 26 Jun 2018, 03:22
by zbgns
You may use Acrobat Pro to export the pdf to MS Word file (Save -> Export to -> Microsoft Word). Edit the output .docx (OCR errors corrections may be necessary it is also useful to use styles to indicate headings, chapters and so on). Afterwards this may be easely exported to epub format.

BTW, Callibre is not the best available epub converter (although is the best epub editor) so I personally prefer to export .docx (or .odt) using other tools, e.g.there are Libreoffice extensions for that. The epub file may be imported to Callibre, finally edited and sent to e-book reader.

Re: Converting PDF to ePub Calibre not working

Posted: 26 Jun 2018, 03:37
by L.Willms
And don't forget to convert most line breaks into spaces, so that you have free flowing text for the Epub.

It is better to use another OCR which does not keep the line breaks as Acrobat does, if the ultimate goal is an EPUB.

PDF is meant to present the look of each page as the PDF creator has intended, EPUB on the other hand is free flowing text which adapts itself automatically to each text size on each screen size of an EBook-Reader. Dont go via PDF if that is what you want.

Re: Converting PDF to ePub Calibre not working

Posted: 26 Jun 2018, 03:44
by L.Willms
zbgns wrote: 26 Jun 2018, 03:22 BTW, Callibre is not the best available epub converter (although is the best epub editor) so I personally prefer to export .docx (or .odt) using other tools, e.g.there are Libreoffice extensions for that. The epub file may be imported to Callibre, finally edited and sent to e-book reader.
Calibre is mainly a manager for your eBook library, which has as a side job also the ability to modify eBooks.

I recommend the Writer2epub extension for Open Office and its LibreOffice offspring. Check out the en.Wikipedia.org article on Writer2epub.

Re: Converting PDF to ePub Calibre not working

Posted: 26 Jun 2018, 07:01
by zbgns
L.Willms wrote: 26 Jun 2018, 03:37 And don't forget to convert most line breaks into spaces, so that you have free flowing text for the Epub.

It is better to use another OCR which does not keep the line breaks as Acrobat does, if the ultimate goal is an EPUB.
Newer versions of Acrobat Pro join lines into paragraphs and in general try to save layout of original documents (as I remember Acrobat 11 Pro does). So usually no additional converting of line brakes into spaces is necessary (apart of typical OCR errors).

On the other hand I agree that there are better OCR programs than Acrobat. Personally I prefer tesseract (current 4.00 beta version) plus gImageReader frontend. As a matter of fact any text formatting is lost (it saves plain text) but usually it takes less time to recreate text formatting from scratch than correcting formatting produced by Acrobat. gImageReader has function of rectangular selection of specific area of Image which is OCR-ed (it may be applied to multiple pages at once) which is very useful for ridding off headings and footers (e.g. page numbers which are useless in reflowable text). I guess that Abbyy FineReader or OmniPage may be more convenient tools, but I have only little experience with them.

Re: Converting PDF to ePub Calibre not working

Posted: 26 Jun 2018, 08:10
by L.Willms
zbgns wrote: 26 Jun 2018, 07:01 I guess that Abbyy FineReader or OmniPage may be more convenient tools, but I have only little experience with them.
Thanks for the clarification on Acrobat joining lines to paragraphs -- but Acrobat would have to do guesswork.

I use ABBYY FineReader. I got the version 6 "Sprint" as add on to an Epson flat bed scanner, and later upgraded to version 11, and still later bought version 12 with a "Black Friday" discount. I hope that this year's "Black Friday" will bring another offer which I can't refuse... Besides that I flirt with the idea to acquire a contingent of pages for recognition of Fraktur (Gothic) script, once I have a project for that...

FineReader keeps character formatting like italics or bold with output form "Free form text", and has also the option to keep line breaks. Besides this output form, it can produce PDFs or Djvu files with the identical layout as the original, with or without the image of the original document. Also EPUB/FB2. I have never tried Djvu, EPUB, and don't know what FB2 is.

HELP this is much more difficult than I thought it should be

Posted: 28 Jun 2018, 00:37
by BryanMatt
Well, I thank you for all of your advice. It kind of appears to be a hellava lot of work to create an epub because the text needs to be basically rebuilt, or rewritten into MS Word or some other word processor and then converted to an epub.

I'm not a Linux user, I had Linux Ubuntu years ago but I really didn't like constantly hacking my computer to make it work. I just want it to work therefore I use Windows with a few tweaks, but I just don't want to spend all my time learning commands I want to simply push a button and convert the pdf to epub.

Calibre has done this for me with other books but for some reason I can't get it to work with this particular book. The problem is all ocr software I've tried other than Adobe basically converts the scanned images to pictures basically inserted into pdf pages or epub pages, or whatever format. I can't seem to find a way to extract the text to any format.

So i'm left with opening MS Word and cutting and pasting the text page by page into word, formatting it, fixing spelling errors and typos, etc. etc. When I'm done I guess I will use Calibre or Sigil or something else to convert that document into an epub.

Is it going to be this hard for all my books?

I have hundreds of somewhat rare books that are not in print anymore and I'm trying to preserve them. I want epub because that typically maintains the right size depending on whatever reader I'm using. I usually use Google Play reader or Kindle. The advantage of having epub is that Google Play will read the text from an epub to me and I commute 2 hours a day to work so I like to have a book to read (listen to) while I'm driving.

Anyway, thank you again for your advice. I'm really wishing someone has some new ideas that universally work in a Windows environment. I have Wiindows 8.1 64-bit.

If you respond back asking why I didn't try Libreoffice or one of the other solutions offered I looked at those solutions and it did not appear to be more of an advantage above MS Word. And the other suggestion was a Linux based software and I'm sorry but if you have windows based software I would really appreciate it.

Re: Converting PDF to ePub Calibre not working

Posted: 28 Jun 2018, 01:31
by L.Willms
zbgns wrote: 26 Jun 2018, 03:22 You may use Acrobat Pro to export the pdf to MS Word file (Save -> Export to -> Microsoft Word).
Can't the regular, non-Pro, Adobe Reader not export the text?

Re: Converting PDF to ePub Calibre not working

Posted: 28 Jun 2018, 01:53
by L.Willms
BryanMatt wrote: 25 Jun 2018, 22:24 Then I converted those files into a pdf with Adobe acrobat Pro 11 and ran OCR.
So you have Acrobat Pro, that means you can export the whole text into a MS Word document (Acrobat 11 should know DOCX), and use this to build an EPUB. It might be that Acrobat Pro 11 can even export to ODT, the text format of OpenOffice and LibreOffice.

LibreOffice is recommended because the writer2epub add on is written for OpenOffice and derivates, and works well. LibreOffice is free, and writer2epub too.

It might be possible to port writer2epub to MS Office, but I refrained from the work because LibreOffice (and OpenOffice) can open DOCX files without any problem. There might also exist similar features for MS Office, but I don't know them.
zbgns wrote: Newer versions of Acrobat Pro join lines into paragraphs and in general try to save layout of original documents (as I remember Acrobat 11 Pro does). So usually no additional converting of line brakes into spaces is necessary (apart of typical OCR errors).
Why do you think that this is so hard to do?

Re: HELP this is much more difficult than I thought it should be

Posted: 28 Jun 2018, 04:52
by zbgns
BryanMatt wrote: 28 Jun 2018, 00:37 Well, I thank you for all of your advice. It kind of appears to be a hellava lot of work to create an epub because the text needs to be basically rebuilt, or rewritten into MS Word or some other word processor and then converted to an epub.
You are fully right. There is a lot of work to convert pdf into epub and have fully satisfactory result. Callibre is simply not the right tool to do this. The easiest way is to use Abbyy FineReader as it can convert images directly to epub format. You probably should try this. The code inside epubs built this way seems to be quite messy, but i think that majority of people doesn't care about this much.
BryanMatt wrote: 28 Jun 2018, 00:37
If you respond back asking why I didn't try Libreoffice or one of the other solutions offered I looked at those solutions and it did not appear to be more of an advantage above MS Word. And the other suggestion was a Linux based software and I'm sorry but if you have windows based software I would really appreciate it.
There was no any Linux based suggestion in my reply at all. Acrobat Pro and MS Word are not available under Linux. Libreoffice is the same under Windows and Linux and uses the same extensions. So you may use Writer2epub Libreoffice extension as L.Willms recommended also under Windows.
Also gImageReader has its Windows version(s). You may choose between various variants: https://github.com/manisandro/gImageReader/releases