PDF autoflow, or "poor man OCR", or "OWR"
Moderator: peterZ
PDF autoflow, or "poor man OCR", or "OWR"
I just had an idea to get auto-flow for ebooks scanned in PDF or image format.
OWR is for "Optical Words Recognition", as opposite to "Optical Characters Recognition".
The second one is very tricky, complex and difficult to implement, and it never gives 100% accurate results. But being able to reflow text is one of the biggest advantage of having electronic rather then printed text (besides searchability of course).
So, why not to implement text autoflow also in scanned texts?
I think it's possible: rather than have the SW looking for single characters (often the most difficult part, as some parts of proportional characters can overlap other characters), it could just look for single words, much easier to distinguish one from the other!
And, as actually most word processors justify text without splitting words, there will be no difference among a text justified by a word processor and a series of images/words justified like text.
I wonder how difficult could be to write a SW which looks for words into a scanned page, and which produced a "refurbished PDF page".
OWR is for "Optical Words Recognition", as opposite to "Optical Characters Recognition".
The second one is very tricky, complex and difficult to implement, and it never gives 100% accurate results. But being able to reflow text is one of the biggest advantage of having electronic rather then printed text (besides searchability of course).
So, why not to implement text autoflow also in scanned texts?
I think it's possible: rather than have the SW looking for single characters (often the most difficult part, as some parts of proportional characters can overlap other characters), it could just look for single words, much easier to distinguish one from the other!
And, as actually most word processors justify text without splitting words, there will be no difference among a text justified by a word processor and a series of images/words justified like text.
I wonder how difficult could be to write a SW which looks for words into a scanned page, and which produced a "refurbished PDF page".
Re: PDF autoflow, or "poor man OCR", or "OWR"
Hi,
we have already opensource which can make ocr (even if it is not 100% correct), after the ocr you can simply wrap the recognized text and map the image data of the word to the new position,
even if the ocr makes a lot of mistakes, to wrap a line it is not imported
maybe it is a nice google summer of code project,
we have already opensource which can make ocr (even if it is not 100% correct), after the ocr you can simply wrap the recognized text and map the image data of the word to the new position,
even if the ocr makes a lot of mistakes, to wrap a line it is not imported
maybe it is a nice google summer of code project,
Re: PDF autoflow, or "poor man OCR", or "OWR"
Thanks, but I have no interest in wrapping a wrong text... I just need wrapping, no OCR. OCR is very unhelpful on Italian texts.
- rob
- Posts: 773
- Joined: 03 Jun 2009, 13:50
- E-book readers owned: iRex iLiad, Kindle 2
- Number of books owned: 4000
- Country: United States
- Location: Maryland, United States
- Contact:
Re: PDF autoflow, or "poor man OCR", or "OWR"
I'm not saying it wouldn't work. It could be a good area to research, but the accepted wisdom is that it's easier and more effective to recognize letters and then use a dictionary to correct any mistakes. Most likely the human brain also recognizes individual letters, with something similar to dictionary lookup: at the lower layer, letters are recognized, and at the next few higher layers, valid combinations are recognized, and the combinations which are more likely will feed back down to the letter recognizers to inhibit the wrong letters. Not that it's the most efficient or effective way to do it, but at least we know it worksjumpjack wrote:it could just look for single words, much easier to distinguish one from the other!
Unless what you're saying is to locate words rather than recognize them?
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.
- Moonboy242
- Posts: 56
- Joined: 22 Aug 2010, 18:09
- E-book readers owned: iPad, Netbook
- Number of books owned: 1000
Re: PDF autoflow, or "poor man OCR", or "OWR"
iPad: Over it. Android FTW.
Re: PDF autoflow, or "poor man OCR", or "OWR"
I think it's a better way to define what I'd like to work.rob wrote:
Unless what you're saying is to locate words rather than recognize them?
Locate, extract, save to single files.
Re: PDF autoflow, or "poor man OCR", or "OWR"
maybe i didn't explained it enough,jumpjack wrote:Thanks, but I have no interest in wrapping a wrong text... I just need wrapping, no OCR. OCR is very unhelpful on Italian texts.
e.g. when you use an ocr software you can just use the text output (this ist not what you wont, this text output has errors)
an other way to use ocr software is to let the image layer intact and put behind the image layer the text layer, if you are looking on the output you will see the exact scanned page, but you can select with your mouse the text, behind the words in the image is a hidden text layer, this selected and copied text can have normal ocr errors.
With this hidden text layer behind the image you are also able to search in the document
even if ocr is bad, as long you software (acrobat or dejav reader) can handle this format it is an advantage
this software is already available, the text which is recognized buy the ocr software have also the information where the word was in the image
the ocr software gives you already all the information what you need, you just have to use it in an other way
you are free to delete the text layer after the wrapping, the point is that OWR is easier to implement (in my opinion) but someone needs to program it, but be realistic, when you don't spend full time you will get not even close to whats already able with ocr, using bad ocr to make good owr and then reflow the image is much more efficient
- daniel_reetz
- Posts: 2812
- Joined: 03 Jun 2009, 13:56
- E-book readers owned: Used to have a PRS-500
- Number of books owned: 600
- Country: United States
- Contact:
Re: PDF autoflow, or "poor man OCR", or "OWR"
Yeah, I think I get the idea -- break the words into image chunks on a grid, and then shift them around as necessary to fit on a reduced screen. It's interesting... if you could reliably recognize word-chunks, you could treat them as glyphs in an index and do compression that way.
Re: PDF autoflow, or "poor man OCR", or "OWR"
This would be a pretty good idea, and Tulon's coupled snakes algorithm can detect word "shapes" quite well. All you would have to do is detect punctuation and spaces, which isn't too hard either. Go for it.
Re: PDF autoflow, or "poor man OCR", or "OWR"
This is exactly what djvubind does, but it does have it's hiccups when you're translating English text with phonetic Russian words. I've never read something so interesting, as Tesseract repeatedly "reads" the same words incorrectly.Gerard wrote:With this hidden text layer behind the image you are also able to search in the documentjumpjack wrote:Thanks, but I have no interest in wrapping a wrong text... I just need wrapping, no OCR. OCR is very unhelpful on Italian texts.
even if ocr is bad, as long you software (acrobat or dejav reader) can handle this format it is an advantage
But yes, the hidden layer is pretty useful too. It takes a lot longer to process than just compressing images (I can do a 300 page book in around 5 minutes. The same book was going for 10 hours with OCR, and even then I had to split it up into 10 page chunks due to the program crashing).