How to make font from scan (clearscan alternative) and please post examples of captured images

foler · Post by **foler** » 20 Apr 2016, 14:32

1. Clearscan in Adobe Acrobat convert scan to font at exact same positions by build custom vector font on the fly. Is it on market some alternative? This is very interesting feature for us because we can extract position of each glyph for marking and searching text.

2. Can someone post there some captured images with digital cameras? Want to compare it to scan from copier.

Thanks

duerig · Post by **duerig** » 20 Apr 2016, 14:43

(1) I don't know any alternatives to Adobe Acrobat's particular method. The common alternative is to either just read a (cleaned up) image or OCR the image and read the resulting text. I've used Clearscan myself and while I like the idea of it, I haven't found it to be a big improvement.

(2) I think you really want to capture the same page both ways to get a good comparison. But take a look at these samples if you want to try some post-processing techniques to see how they work:

http://tenrec.builders/quill/samples/sample-1.jpg
http://tenrec.builders/quill/samples/sample-2.jpg
http://tenrec.builders/quill/samples/sample-3.jpg

-D

duerig · Post by **duerig** » 20 Apr 2016, 14:57

Oh. In case you didn't know, there is a common technique (especially for DJVU files but I think you can do it for PDF too) where you do both OCR and read the image. The OCR creates a hidden background layer which you can use for text selection and search. While reading the foreground image means that during normal reading, there will never be any formatting errors and if there is an occasional OCR glitch then you won't be distracted by it.

So not exactly Clearscan, but it can get you similar results. The program I have used for this most often is djvubind which generates a DJVU file. There is a similar one for PDF called pdfbeads, but I haven't had much luck with it when I tried it. There may be others around as well.

-D

cday · Post by **cday** » 20 Apr 2016, 17:20

foler wrote:1. Clearscan in Adobe Acrobat converts scan to font at exact same positions by building custom vector fonts on the fly. Is it on market some alternative? This is very interesting feature for us because we can extract position of each glyph for marking and searching text.

I don't know of any alternative although it would be interesting if there were one: possibly the leading OCR software providers Abbyy and Nuance will introduce an equivalent technology eventually, but Adobe have a strong background in font development from their early days so they have a significant advantage over other companies.

I've posted several times on ClearScan and you can find my posts from a forum search for 'cday clearscan': probably the most immediately relevant for you is this post. I generally stand by what I have written in the past but can't give a warranty without rereading my early posts...

So no known alternative way of accurately extracting the position of each glyph. The most common way of OCR'ing a camera image is to produce, as duerig mentions, a searchable version of the original bitmap page image, with the advantage that the original image as captured is viewed, and OCR errors within reason are unlikely to be a significant problem. That form is sometimes referred to as 'text under the page image' with the search result effectively in a hidden layer.

An alternative that is less commonly used is to generate a vector text representation of the page, as if the text had been created in a word processor, which results in excellent scaleable text quality and a small file size, but is unlikely to accurately reproduce the exact appearance of the original page, and will include any OCR recognition errors unless the output is proof read carefully. That form is sometimes referred to as 'text over the page image' although the file doesn't actually contain the original image.

But neither of those alternatives directly relate to your original question!

foler · Post by **foler** » 21 Apr 2016, 06:30

Most powerful solution for extracting text from pdf and finding exact glyph/word/line coordinates is tet plugin and workflow around it:
https://www.pdflib.com/products/
But for this I need ocr each glyph at SAME position. Now I know only that clearscan allow this.

cday · Post by **cday** » 21 Apr 2016, 10:02

foler wrote:Most powerful solution for extracting text from pdf and finding exact glyph/word/line coordinates is tet plugin and workflow around it:
https://www.pdflib.com/products/
But for this I need ocr each glyph at SAME position. Now I know only that clearscan allow this.

Are you sure you can't obtain the result you need by using TET directly on the raster (bitmap) images from the camera?

ClearScan vector text should be placed close to the position of the characters on the bitmap image, and the character outlines will be smoothed depending on the quality of the original images and the number of instances of each character processed, but will that actually produce better output than using TET directly on the original images? I'm not sure exactly what output you wish to obtain.

foler · Post by **foler** » 21 Apr 2016, 10:10

We convert scanned pdf to swf for our textbook multimedia platform for schools. Now we use pdf2swf utility to make swf. Text with coordinates is extracted to xml. But this tool have sometimes problem with clearscan pdfs. Tet plugin have some inteligent algoritam and output to xml is fantastic. He better understand words. With tet you can extract text from 1500 pages book in 5-7 sec!!! Tet dont have ocr as i know.
This is reason why we need exact position for glyphs.

cday · Post by **cday** » 21 Apr 2016, 15:37

foler wrote:We convert scanned pdf to swf for our textbook multimedia platform for schools. Now we use pdf2swf utility to make swf. Text with coordinates is extracted to xml. But this tool have sometimes problem with clearscan pdfs. Tet plugin have some inteligent algoritam and output to xml is fantastic. He better understand words. With tet you can extract text from 1500 pages book in 5-7 sec!!! Tet dont have ocr as i know.
This is reason why we need exact position for glyphs.

Your problem is that pdf2swf sometimes doesn't work correctly with ClearScan PDFs? If you could convert those ClearScan PDFs (containing vector text) to PDFs containing bitmap image images of the pages as if they were normal scanned pages, might that provide a practical solution?

I've done a quick test using three pages of a ClearScan PDF book file and successfully converted the file to a three-page PDF bitmap image file and attach the files below; the text in the new PDF isn't searchable if that is a consideration, but it could of course be made searchable using OCR software.

DIY Book Scanner

How to make font from scan (clearscan alternative) and please post examples of captured images

How to make font from scan (clearscan alternative) and please post examples of captured images

Re: How to make font from scan (clearscan alternative) and please post examples of captured images

Re: How to make font from scan (clearscan alternative) and please post examples of captured images

Re: How to make font from scan (clearscan alternative) and please post examples of captured images

Re: How to make font from scan (clearscan alternative) and please post examples of captured images

Re: How to make font from scan (clearscan alternative) and please post examples of captured images

Re: How to make font from scan (clearscan alternative) and please post examples of captured images

Re: How to make font from scan (clearscan alternative) and please post examples of captured images