Announcing smoothscan: Free software scan vectorizing tool

ncraun · Post by **ncraun** » 23 Aug 2013, 17:11

Hello everyone, I have written a tool to convert scanned document pages to a vectorized font pdf. It is released under the GPL v3 license. Currently only Linux is supported, but I plan to add support for Windows and Mac in the near future.

Please check out the project homepage at https://natecraun.net/projects/smoothscan/ to download the first release, v0.1.0
developers go to the github project page https://github.com/ncraun/smoothscan for access to the latest source code.

Sample comparing the original scan to the vectorized version of a zoomed in 'R' character.
[img=500x500]

[/img]

Also note that is is very new software, so there are probably bugs lurking beneath the surface. The more users we get the faster the bugs can be uncovered, and fixed.

Ask me any questions you have, I'd love to answer them.

Here is a copy of the text from the README file. It will tell you everything you need to know about the project:

smoothscan is a tool to convert scanned text into a vectorized output
form. Because printed text is assembled from fonts, each particular
letter (like 'o') will have the same shape as every other 'o' in the
document. We can take advantage of this, by building a table of such
symbols, and represent each occurrence of a symbol with a reference to
that symbol's table entry. This will save a lot of space, and a
similar idea is used in djvu's jb2 mode and JBIG2 for PDF.

smoothscan builds up this table, but instead of filling the table with
the original raster images, it vectorizes each symbol. Vector images
will look smoother than their raster equivalents, and can be scaled
without introducing pixelation. These properties result in a smaller
output file size, as well as making the scanned text images more
readable.

smoothscan saves the vectorized images into a custom TrueType font and
embeds the font into the output pdf file. Currently each symbol is
mapped to an arbitrary letter in the font, but in future versions you
could run OCR on each symbol, and ensure that the 'o' image is
associate with the 'o' character encoding in the generated font.

To get good results, you must have good input. Higher resolution scans
capture more detail about the shape of each symbol, so a higher
quality vectorized version can be created. It's a good idea to process
your scanned images using a tool like ScanTailor before running
smoothscan.

Current smoothscan can only process pure black and white 1bpp images,
but in the future support will be added for other formats, especially
ScanTailor's Mixed output mode.

smoothscan is currently targeted at GNU/Linux based systems, but
Windows and OS X will be supported in future versions.

spamsickle · Post by **spamsickle** » 23 Aug 2013, 23:39

I haven't tried your software yet, but I intend to. I have had a little experience with Clearscan, which you seem to be emulating. Clearscan appears to create a custom font which covers the range of pages you're converting (and, at least in my experience, tended to crash if you tried to convert more than about 50-100 pages at a time).

It would create several different vector fonts for each letter, and choose the best fit. I don't recall if there were options which would control how many duplicates might be created (or what I assume would be the same thing, how precisely the vector needed to match the binary character).

I've been thinking for some time that the best way to get smaller file sizes would be to go from OCR to a vector font, whether that was a named font or a custom font. The risk of that approach is that OCR errors could lead to erroneous page images, something which "OCR behind scanned image" avoids even if the OCR is lousy. If the scanned image is preserved, readability is maintained even if bad OCR degrades searchability.

It looks like Zylab may be doing something similar, allowing searches to be created using the custom font, thus avoiding the pitfalls of bad OCR.

I'm looking forward to trying your software in the next couple of weeks, and I'll let you know how well it works.

ncraun · Post by **ncraun** » 24 Aug 2013, 07:05

spamsickle wrote:I haven't tried your software yet, but I intend to. I have had a little experience with Clearscan, which you seem to be emulating. Clearscan appears to create a custom font which covers the range of pages you're converting (and, at least in my experience, tended to crash if you tried to convert more than about 50-100 pages at a time)..

In my testing, I have been able to create a 500+ page book without the software crashing. It took about 8min, but I believe I can reduce that time by adding multithreading to the font generation step.

spamsickle wrote: It would create several different vector fonts for each letter, and choose the best fit. I don't recall if there were options which would control how many duplicates might be created (or what I assume would be the same thing, how precisely the vector needed to match the binary character).

I've been thinking for some time that the best way to get smaller file sizes would be to go from OCR to a vector font, whether that was a named font or a custom font. The risk of that approach is that OCR errors could lead to erroneous page images, something which "OCR behind scanned image" avoids even if the OCR is lousy. If the scanned image is preserved, readability is maintained even if bad OCR degrades searchability.

The symbols on the page are mapped to arbritrary font code points, smoothscan doesn't have any ocr support yet. In the future I plan to add OCR support, probably through libtesseract. Right now the character 'o' could be mapped to the character '7', for example, so we don't have support for accurate full text searching in this early version.

Symbols are classified in a way similar to how JBIG2 works. If you aren't familiar with JBIG2 compression, there's an interesting page on it here: http://leptonica.com/jbig2.html. Basically it breaks the page apart into a number of symbols, and if two symbols are similar enough, it replaces one symbol with the other. These symbols aren't just confined to full letters though, it could be any symbol on the page, for example like a little logo. You can "control how many duplicates" with the threshold and weight parameters, though the default works pretty well in most cases. A higher threshold results in less duplicates, but it could potentially cause similar to replace each other, for example an 'a' could be replaced by an 'o' if you set the parameter too high. It is better to overclassify and have more duplicates than you need, than to underclassify and potentially have a character replaced by a different one.

However, JBIG2 classification and OCR are two different steps in the process. OCR will (in the future) be performed after the classification step. So even if the OCR makes a mistake, it will only affect what character that glyph if mapped to, not the shape of the glyph itself. It would still have the shape of an 'a', but if the OCR makes a mistake if you tried to copy and paste that 'a' your copy and paste would result in an 'o'.

There are some tradeoffs to using this process of course, if you have poor quality scans, the result is not going to be as good as if you had high quality scans. I would keep a backup copy of the original scans in any case, especially as smoothscan is still very new.

I don't think that Zylab thing is quite the same, it appears to be working more with nontextual data.

spamsickle · Post by **spamsickle** » 27 Aug 2013, 15:43

Thanks for that link to Leptonica. It is very interesting, and answered a lot of questions I had. I'd kind of missed the JBIG2 move, and the discussion of it here, so that made me aware of something I should have been paying attention to.

I apologize that I still haven't tried your vectorizing tool. More and more lately I'm not doing much post-processing of my scans, but I do have some books that were created by Scan Tailor some time ago, and I promise I'll get to it this week. It looks like a very useful tool.

Oh, and belatedly, welcome to the forum.

Minimalist · Post by **Minimalist** » 14 Sep 2013, 13:03

Great to have something like ClearScan being added to the open source arsenal.
When complete, it does an amazing job of adding OCR, while reducing file size (a lot!) and improving visual quality (a lot).
ClearScan does not let you tinker with the settings, and to be on the safe side it creates a lot more symbols than needed.
It would be great to have more flexibility with that, or possibly even some supervised classification.
I also wondered if you already average bitpatterns that were classified as representing the same symbol, to obtain super-resolution before vectorizing the result?

Good luck with the work!

ncraun · Post by **ncraun** » 15 Sep 2013, 09:49

Minimalist wrote: It would be great to have more flexibility

smoothscan is using the excellent open source leptonica libraries to help with image classification. Currently users can tweak the threshold and weight parameters to provide more flexibility. A higher threshold value means less symbol templates will be produced, but runs the risk of underclassifying (similar symbols getting misrecognized as one another), a lower value takes a more conservative approach, but could result in overclassification (slightly different versions of the same character having multiple font entries). In general its better to err on the side of overclassifying, as its better to waste a bit of space on a few letters than to have letters mistaken for one another. I've tried to set a default as a reasonable balance between these extremes, and for more information check out the man page, readme, etc.

Minimalist wrote: or possibly even some supervised classification.

Given the large number of symbols generated in a typical book, I don't know if supervised classification would be worth it. It would take so much time, you might as well just retype the whole book.

Minimalist wrote: I also wondered if you already average bitpatterns that were classified as representing the same symbol, to obtain super-resolution before vectorizing the result?

Interesting idea. Might look more into it in the future, but right now there are some higher priorities for the project.

DIY Book Scanner

Announcing smoothscan: Free software scan vectorizing tool

Announcing smoothscan: Free software scan vectorizing tool

Re: Announcing smoothscan: Free software scan vectorizing to

Re: Announcing smoothscan: Free software scan vectorizing to

Re: Announcing smoothscan: Free software scan vectorizing to

Re: Announcing smoothscan: Free software scan vectorizing to

Re: Announcing smoothscan: Free software scan vectorizing to