How to make font from scan (clearscan alternative) and please post examples of captured images

General discussion about software packages and releases, new software you've found, and threads by programmers and script writers.

Moderator: peterZ

Post Reply
foler
Posts: 9
Joined: 15 Apr 2016, 06:24
Number of books owned: 0
Country: Croatia

How to make font from scan (clearscan alternative) and please post examples of captured images

Post by foler »

1. Clearscan in Adobe Acrobat convert scan to font at exact same positions by build custom vector font on the fly. Is it on market some alternative? This is very interesting feature for us because we can extract position of each glyph for marking and searching text.

2. Can someone post there some captured images with digital cameras? Want to compare it to scan from copier.

Thanks
duerig
Posts: 388
Joined: 01 Jun 2014, 17:04
Number of books owned: 1000
Country: United States of America

Re: How to make font from scan (clearscan alternative) and please post examples of captured images

Post by duerig »

(1) I don't know any alternatives to Adobe Acrobat's particular method. The common alternative is to either just read a (cleaned up) image or OCR the image and read the resulting text. I've used Clearscan myself and while I like the idea of it, I haven't found it to be a big improvement.

(2) I think you really want to capture the same page both ways to get a good comparison. But take a look at these samples if you want to try some post-processing techniques to see how they work:

http://tenrec.builders/quill/samples/sample-1.jpg
http://tenrec.builders/quill/samples/sample-2.jpg
http://tenrec.builders/quill/samples/sample-3.jpg

-D
duerig
Posts: 388
Joined: 01 Jun 2014, 17:04
Number of books owned: 1000
Country: United States of America

Re: How to make font from scan (clearscan alternative) and please post examples of captured images

Post by duerig »

Oh. In case you didn't know, there is a common technique (especially for DJVU files but I think you can do it for PDF too) where you do both OCR and read the image. The OCR creates a hidden background layer which you can use for text selection and search. While reading the foreground image means that during normal reading, there will never be any formatting errors and if there is an occasional OCR glitch then you won't be distracted by it.

So not exactly Clearscan, but it can get you similar results. The program I have used for this most often is djvubind which generates a DJVU file. There is a similar one for PDF called pdfbeads, but I haven't had much luck with it when I tried it. There may be others around as well.

-D
cday
Posts: 456
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: How to make font from scan (clearscan alternative) and please post examples of captured images

Post by cday »

foler wrote:1. Clearscan in Adobe Acrobat converts scan to font at exact same positions by building custom vector fonts on the fly. Is it on market some alternative? This is very interesting feature for us because we can extract position of each glyph for marking and searching text.
I don't know of any alternative although it would be interesting if there were one: possibly the leading OCR software providers Abbyy and Nuance will introduce an equivalent technology eventually, but Adobe have a strong background in font development from their early days so they have a significant advantage over other companies.

I've posted several times on ClearScan and you can find my posts from a forum search for 'cday clearscan': probably the most immediately relevant for you is this post. I generally stand by what I have written in the past but can't give a warranty without rereading my early posts...

So no known alternative way of accurately extracting the position of each glyph. The most common way of OCR'ing a camera image is to produce, as duerig mentions, a searchable version of the original bitmap page image, with the advantage that the original image as captured is viewed, and OCR errors within reason are unlikely to be a significant problem. That form is sometimes referred to as 'text under the page image' with the search result effectively in a hidden layer.

An alternative that is less commonly used is to generate a vector text representation of the page, as if the text had been created in a word processor, which results in excellent scaleable text quality and a small file size, but is unlikely to accurately reproduce the exact appearance of the original page, and will include any OCR recognition errors unless the output is proof read carefully. That form is sometimes referred to as 'text over the page image' although the file doesn't actually contain the original image.

But neither of those alternatives directly relate to your original question!
foler
Posts: 9
Joined: 15 Apr 2016, 06:24
Number of books owned: 0
Country: Croatia

Re: How to make font from scan (clearscan alternative) and please post examples of captured images

Post by foler »

Most powerful solution for extracting text from pdf and finding exact glyph/word/line coordinates is tet plugin and workflow around it:
https://www.pdflib.com/products/
But for this I need ocr each glyph at SAME position. Now I know only that clearscan allow this.
cday
Posts: 456
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: How to make font from scan (clearscan alternative) and please post examples of captured images

Post by cday »

foler wrote:Most powerful solution for extracting text from pdf and finding exact glyph/word/line coordinates is tet plugin and workflow around it:
https://www.pdflib.com/products/
But for this I need ocr each glyph at SAME position. Now I know only that clearscan allow this.
Are you sure you can't obtain the result you need by using TET directly on the raster (bitmap) images from the camera?

ClearScan vector text should be placed close to the position of the characters on the bitmap image, and the character outlines will be smoothed depending on the quality of the original images and the number of instances of each character processed, but will that actually produce better output than using TET directly on the original images? I'm not sure exactly what output you wish to obtain.
foler
Posts: 9
Joined: 15 Apr 2016, 06:24
Number of books owned: 0
Country: Croatia

Re: How to make font from scan (clearscan alternative) and please post examples of captured images

Post by foler »

We convert scanned pdf to swf for our textbook multimedia platform for schools. Now we use pdf2swf utility to make swf. Text with coordinates is extracted to xml. But this tool have sometimes problem with clearscan pdfs. Tet plugin have some inteligent algoritam and output to xml is fantastic. He better understand words. With tet you can extract text from 1500 pages book in 5-7 sec!!! Tet dont have ocr as i know.
This is reason why we need exact position for glyphs.
cday
Posts: 456
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: How to make font from scan (clearscan alternative) and please post examples of captured images

Post by cday »

foler wrote:We convert scanned pdf to swf for our textbook multimedia platform for schools. Now we use pdf2swf utility to make swf. Text with coordinates is extracted to xml. But this tool have sometimes problem with clearscan pdfs. Tet plugin have some inteligent algoritam and output to xml is fantastic. He better understand words. With tet you can extract text from 1500 pages book in 5-7 sec!!! Tet dont have ocr as i know.
This is reason why we need exact position for glyphs.
Your problem is that pdf2swf sometimes doesn't work correctly with ClearScan PDFs? If you could convert those ClearScan PDFs (containing vector text) to PDFs containing bitmap image images of the pages as if they were normal scanned pages, might that provide a practical solution?

I've done a quick test using three pages of a ClearScan PDF book file and successfully converted the file to a three-page PDF bitmap image file and attach the files below; the text in the new PDF isn't searchable if that is a consideration, but it could of course be made searchable using OCR software.
Attachments
ClearScan_3-page_file.pdf
ClearScan_file
(458.04 KiB) Downloaded 494 times
Bitmap_image_3-page_file.pdf
Bitmap_image_file
(256.74 KiB) Downloaded 507 times
Post Reply