PDF to text recommendations ?

Convert page images into searchable text. Talk about software, techniques, and new developments here.

Moderator: peterZ

Post Reply
Rime

PDF to text recommendations ?

Post by Rime »

I read a lot of older science fiction and fantasy, and much of it was printed on the high acid paper that is turning yellow and brittle. My evil siblings got me a Kindle for my birthday, and since then, I have become a convert to the electronic side.

So, I recently sorted through my collection and picked out all of the dups and sent them to 1DollarScan. They scanned the books in and sent me the PDFs ( the scanning is destructive : I'm going to build my own scanner at some point in the future so I can deal with the books that are irreplaceable ). My goal is to convert them all to straight ASCII, and then I can use them on any reader.

2 chapters of OCR conversion later using FreeOCR, I have concluded that what I really need/want is something that can slurp in the whole book PDFs that 1DollarScan sent me so that I don't have to deal with the OCR->text page by page and can just focus on copy editing the OCR'd results. Any recommendations ? From what I've read here, it looks like a lot of people are using ABBYY FineReader. Is it suitable ? Any other recommendations ?

2 chapters down, 14 chapters, 21 books to go....

Thanks !
Anonymous2
Posts: 97
Joined: 18 Oct 2011, 16:05

Re: PDF to text recommendations ?

Post by Anonymous2 »

I'm cheap and don't like payware software, so I use freeware whenever possible.

Tesseract is Google's free OCR engine, and it is amazing. I run Linux and am used to CLI software (I prefer it over GUI most of the time), so it isn't for the faint of heart. There are graphical frontends to it, though, so don't worry.

One problem could be your PDFs, sadly. Are they bitonal (purely black and white), or are they just scans of the pages? Their quality might be an issue as well. Tesseract works best when it receives crisp, clear bitonal text from a scanned page (usually fed through Scan Tailor, another freeware program).

If you want the best accuracy, I suggest you scan the books yourself with a build of your own and process them. I got up to 2,400 pages per hour on my single-camera build (it was literally a DSLR screwed into a shelf), so the scanning shouldn't take you long. And the processing is a really simple process once set up.
quân

Re: PDF to text recommendations ?

Post by quân »

You can use VietOCR, which accepts PDF in addition to other common image formats. Try to limit your PDF to only 40 pages or fewer, or you will get out-of-memory exceptions. The program has built-in tool to help split large PDF into smaller files.
jbrewster

Re: PDF to text recommendations ?

Post by jbrewster »

There is this document grabber you can get online which lets you grab whatever is in your pdf file and save it either as a doc file or a txt file. But I am not sure if they work with scanned pages.

I am pretty sure that if your pdf file contains a page that is printed to it, it will be able to handle that but not a scan of a page saved as a pdf file.
leiyduo
Posts: 2
Joined: 06 Mar 2012, 02:41
E-book readers owned: iPad
Number of books owned: 10
Country: USA

Re: PDF to text recommendations ?

Post by leiyduo »

I've heart that such conversion should be by some speical software, right? Or some file upload websites can make it too. Such as issuu.com
chrisgage
Posts: 10
Joined: 28 Mar 2012, 15:02
E-book readers owned: Kindle
Number of books owned: 1000
Country: Switzerland
Location: Lausanne, Switzerland
Contact:

Re: PDF to text recommendations ?

Post by chrisgage »

I use Nuance Omnipage 17 to scan documents for my collection http://www.ibiblio.org/britishraj -- most of which come from from Archive.org.

Omnipage isn't perfect and it is very slow unless you have a fast multi-core PC. I have used it for more than 10 years and am used to its "quirks". It crashes if you try to do too much work in one operation, but in most cases it can recover everything on restart without having to start again.

Frankly with OCR everything depends on
-- the quality of the original printing of the source document
-- the condition of the source document
-- the care taken when scanning/photographing the book.

If all of the above are perfect, Omnipage or ABBYY can achieve almost perfect results. The software will be unsuccessful if any of the above items are poor. Results are always be worse with tables or multiple columns per page.

I rarely use the "advanced" features such as training files, as they are slow to create and often only apply to the document you are currently doing. And I ALWAYS save in flat text, and then I paste the txt file as unformatted Unicode into Microsoft Word 2010. The "exact image" mode offered by OCR software is a sorry mess of separately placed text boxes which is completely uneditable, and BELIEVE ME you will need to do a fair amount of editing if you want your resulting document to be correct.
Post Reply