PDF autoflow, or "poor man OCR", or "OWR"

Convert page images into searchable text. Talk about software, techniques, and new developments here.

Moderator: peterZ

jumpjack
Posts: 21
Joined: 04 Mar 2014, 00:53

PDF autoflow, or "poor man OCR", or "OWR"

Post by jumpjack »

I just had an idea to get auto-flow for ebooks scanned in PDF or image format.
OWR is for "Optical Words Recognition", as opposite to "Optical Characters Recognition".
The second one is very tricky, complex and difficult to implement, and it never gives 100% accurate results. But being able to reflow text is one of the biggest advantage of having electronic rather then printed text (besides searchability of course).
So, why not to implement text autoflow also in scanned texts?
I think it's possible: rather than have the SW looking for single characters (often the most difficult part, as some parts of proportional characters can overlap other characters), it could just look for single words, much easier to distinguish one from the other!
And, as actually most word processors justify text without splitting words, there will be no difference among a text justified by a word processor and a series of images/words justified like text.

I wonder how difficult could be to write a SW which looks for words into a scanned page, and which produced a "refurbished PDF page".
User avatar
Gerard
Posts: 154
Joined: 17 Oct 2010, 07:15
Number of books owned: 0
Location: Berlin (Germany)

Re: PDF autoflow, or "poor man OCR", or "OWR"

Post by Gerard »

Hi,
we have already opensource which can make ocr (even if it is not 100% correct), after the ocr you can simply wrap the recognized text and map the image data of the word to the new position,
even if the ocr makes a lot of mistakes, to wrap a line it is not imported
maybe it is a nice google summer of code project,
jumpjack
Posts: 21
Joined: 04 Mar 2014, 00:53

Re: PDF autoflow, or "poor man OCR", or "OWR"

Post by jumpjack »

Thanks, but I have no interest in wrapping a wrong text... I just need wrapping, no OCR. OCR is very unhelpful on Italian texts.
User avatar
rob
Posts: 773
Joined: 03 Jun 2009, 13:50
E-book readers owned: iRex iLiad, Kindle 2
Number of books owned: 4000
Country: United States
Location: Maryland, United States
Contact:

Re: PDF autoflow, or "poor man OCR", or "OWR"

Post by rob »

jumpjack wrote:it could just look for single words, much easier to distinguish one from the other!
I'm not saying it wouldn't work. It could be a good area to research, but the accepted wisdom is that it's easier and more effective to recognize letters and then use a dictionary to correct any mistakes. Most likely the human brain also recognizes individual letters, with something similar to dictionary lookup: at the lower layer, letters are recognized, and at the next few higher layers, valid combinations are recognized, and the combinations which are more likely will feed back down to the letter recognizers to inhibit the wrong letters. Not that it's the most efficient or effective way to do it, but at least we know it works ;)

Unless what you're saying is to locate words rather than recognize them?
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.
User avatar
Moonboy242
Posts: 56
Joined: 22 Aug 2010, 18:09
E-book readers owned: iPad, Netbook
Number of books owned: 1000

Re: PDF autoflow, or "poor man OCR", or "OWR"

Post by Moonboy242 »

http://www.microsoft.com/typography/ctf ... ition.aspx

Kinda weird coming from Microshaft! :o
iPad: Over it. Android FTW.
jumpjack
Posts: 21
Joined: 04 Mar 2014, 00:53

Re: PDF autoflow, or "poor man OCR", or "OWR"

Post by jumpjack »

rob wrote:
Unless what you're saying is to locate words rather than recognize them?
I think it's a better way to define what I'd like to work.
Locate, extract, save to single files.
User avatar
Gerard
Posts: 154
Joined: 17 Oct 2010, 07:15
Number of books owned: 0
Location: Berlin (Germany)

Re: PDF autoflow, or "poor man OCR", or "OWR"

Post by Gerard »

jumpjack wrote:Thanks, but I have no interest in wrapping a wrong text... I just need wrapping, no OCR. OCR is very unhelpful on Italian texts.
maybe i didn't explained it enough,
e.g. when you use an ocr software you can just use the text output (this ist not what you wont, this text output has errors)
an other way to use ocr software is to let the image layer intact and put behind the image layer the text layer, if you are looking on the output you will see the exact scanned page, but you can select with your mouse the text, behind the words in the image is a hidden text layer, this selected and copied text can have normal ocr errors.
With this hidden text layer behind the image you are also able to search in the document

even if ocr is bad, as long you software (acrobat or dejav reader) can handle this format it is an advantage

this software is already available, the text which is recognized buy the ocr software have also the information where the word was in the image
the ocr software gives you already all the information what you need, you just have to use it in an other way

you are free to delete the text layer after the wrapping, the point is that OWR is easier to implement (in my opinion) but someone needs to program it, but be realistic, when you don't spend full time you will get not even close to whats already able with ocr, using bad ocr to make good owr and then reflow the image is much more efficient
User avatar
daniel_reetz
Posts: 2812
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: PDF autoflow, or "poor man OCR", or "OWR"

Post by daniel_reetz »

Yeah, I think I get the idea -- break the words into image chunks on a grid, and then shift them around as necessary to fit on a reduced screen. It's interesting... if you could reliably recognize word-chunks, you could treat them as glyphs in an index and do compression that way.
Anonymous1

Re: PDF autoflow, or "poor man OCR", or "OWR"

Post by Anonymous1 »

This would be a pretty good idea, and Tulon's coupled snakes algorithm can detect word "shapes" quite well. All you would have to do is detect punctuation and spaces, which isn't too hard either. Go for it.
Anonymous1

Re: PDF autoflow, or "poor man OCR", or "OWR"

Post by Anonymous1 »

Gerard wrote:
jumpjack wrote:Thanks, but I have no interest in wrapping a wrong text... I just need wrapping, no OCR. OCR is very unhelpful on Italian texts.
With this hidden text layer behind the image you are also able to search in the document

even if ocr is bad, as long you software (acrobat or dejav reader) can handle this format it is an advantage
This is exactly what djvubind does, but it does have it's hiccups when you're translating English text with phonetic Russian words. I've never read something so interesting, as Tesseract repeatedly "reads" the same words incorrectly.

But yes, the hidden layer is pretty useful too. It takes a lot longer to process than just compressing images (I can do a 300 page book in around 5 minutes. The same book was going for 10 hours with OCR, and even then I had to split it up into 10 page chunks due to the program crashing).
Post Reply