Vectorization

scanner · Post by **scanner** » 28 Oct 2010, 06:08

hello

would vectorization of scanned/photographed text, improve OCR results?

Tulon · Post by **Tulon** » 28 Oct 2010, 10:22

It wouldn't. OCR engines work with raster images only, so you would be vectorizing your text first and then rasterizing it again.

spamsickle · Post by **spamsickle** » 28 Oct 2010, 13:05

You can vectorize your PDF at the same time you OCR it with Adobe Acrobat 9's "Clearscan" option. I never really cared much about OCR, but I've started doing it just because it's included in the vectorization AA9 performs. The big advantage, for me, is that the vectorized PDF typically takes up 1/10th of the space on my hard drive that the raster PDF took.

Shaknum · Post by **Shaknum** » 30 Oct 2010, 11:34

I'd have to agree for the most part with the assessments above. Remember though that vectorizing an image significantly smooths out the characters. So it's possible that vectorizing and then converting back to a 1-bit bitonal image might help some small degree. I don't see a difference, here is the data:

Raster Image:

Test-Raster.tif: (346.25 KiB) Downloaded 123 times

OCR Output:

Test - OCR - Raster.rtf: (5.1 KiB) Downloaded 648 times

The Vectorized image is too big for the site, here is the OCR output:

Test - OCR - Vector.rtf: (5.1 KiB) Downloaded 788 times

The Vectorized Bi-Tonal:

Test-Vector-BiTonal.tif: (336.53 KiB) Downloaded 123 times

The Vectorized Bi-Tonal OCR:

Test - OCR - Vector-BiTonal.rtf: (5.22 KiB) Downloaded 744 times

Shaknum · Post by **Shaknum** » 30 Oct 2010, 11:36

Also, I have to second the vote for clearscan. It makes the text really clean and saves a lot of space. It would be great if you could correct OCR errors however, which does not seem to be possible with the clearscan option (no training possible either).

StevePoling · Post by **StevePoling** » 06 Nov 2010, 15:15

OCR is a pattern recognition problem. As such, it works in two parts: feature extraction and pattern classification. These two parts are tuned to work with one another. Feature extraction takes the raster and transforms it into a more compact representation. This more compact representation (hopefully) enhances the meaningful parts of the raster and suppresses the unmeaningful parts, making the pattern classification job easier. The problem with vectorization is that it can fight with feature extraction. In a sense, vectorization is another form of feature extraction, but it is set up for another purpose.

I believe vectorization can be useful in a different context, however. The Kindle's TOPAZ format is a vectorization of scanned imagery. If your reason for doing book scanning is to read on your Kindle, then all you have to do is take your vectorization and convert it to the TOPAZ format. Good luck doing that since the format is Amazon proprietary. You'll need a fair bit of reverse-engineering smarts.

Does anybody know whether PDF handles vector formats like, say SVG?

dingodog · Post by **dingodog** » 06 Nov 2010, 16:39

PDF is a mere container for files

svg can be stored

with

*uniconvertor* (needs python)
- http://sk1project.org/modules.php?name= ... iconvertor

or *Prince xml* (free for personal use)
- http://princexml.com/

svg can be converted in pdf

dingodog · Post by **dingodog** » 07 Nov 2010, 12:50

for Black and white scans of books,

*potrace*
- http://potrace.sourceforge.net/

is the best

it can vectorize and produce directly pdfs, using -b pdf backend

Code: Select all

potrace 1.8. Transforms bitmaps into vector graphics.

Usage: potrace [options] [file...]
General options:
 -h, --help                 - print this help message and exit
 -v, --version              - print version info and exit
 -l, --license              - print license info and exit
 -V, --show-defaults        - print compiled-in defaults and exit
 --progress                 - show progress bar
Input/output options:
 -o, --output <file>        - output to file
Backend selection:
 -e, --eps                  - EPS backend (encapsulated postscript) (default)
 -p, --postscript           - Postscript backend
 -s, --svg                  - SVG backend (scalable vector graphics)
 -g, --pgm                  - PGM backend (portable greymap)
 -b, --backend <name>       - select backend by name
Algorithm options:
 -z, --turnpolicy <policy>  - how to resolve ambiguities in path decomposition
 -t, --turdsize <n>         - suppress speckles of up to this size (default 2)
 -a, --alphamax <n>         - corner threshold parameter (default 1)
 -n, --longcurve            - turn off curve optimization
 -O, --opttolerance <n>     - curve optimization tolerance (default 0.2)
 -u, --unit <n>             - quantize output to 1/unit pixels (default 10)
 -d, --debug <n>            - produce debugging output of type n (n=1,2,3)
Scaling and placement options:
 -W, --width <dim>          - width of output image
 -H, --height <dim>         - height of output image
 -r, --resolution <n>[x<n>] - resolution (in dpi)
 -x, --scale <n>[x<n>]      - scaling factor (pgm backend)
 -S, --stretch <n>          - yresolution/xresolution
 -A, --rotate <angle>       - rotate counterclockwise by angle
 -M, --margin <dim>         - margin
 -L, --leftmargin <dim>     - left margin
 -R, --rightmargin <dim>    - right margin
 -T, --topmargin <dim>      - top margin
 -B, --bottommargin <dim>   - bottom margin
Output options, supported by some backends:
 -C, --color #rrggbb        - set line color (default black)
 --fillcolor #rrggbb        - set fill color (default transparent)
 --opaque                   - make white shapes opaque
 --group                    - group related paths together
Postscript/EPS options:
 -P, --pagesize <format>    - page size (default is letter)
 -c, --cleartext            - do not compress the output
 -2, --level2               - use postscript level 2 compression (default)
 -3, --level3               - use postscript level 3 compression
 -q, --longcoding           - do not optimize for file size
PGM options:
 -G, --gamma <n>            - gamma value for anti-aliasing (default 2.2)
Frontend options:
 -k, --blacklevel <n>       - black/white cutoff in input file (default 0.5)
 -i, --invert               - invert bitmap

Dimensions can have optional units, e.g. 6.5in, 15cm, 100pt.
Default is inches (or pixels for pgm and gimppath backends).
Possible input file formats are: pnm (pbm, pgm, ppm), bmp.
Backends are: eps, postscript, ps, pdf, svg, pgm, gimppath, xfig.

Gerard · Post by **Gerard** » 07 Nov 2010, 13:08

potrace works quite nice
input:

: input.png (3 KiB) Viewed 15460 times

output (default settings)

: potrace output.png (10.81 KiB) Viewed 15460 times

spamsickle · Post by **spamsickle** » 07 Nov 2010, 14:49

dingodog wrote:for Black and white scans of books,

*potrace*
- http://potrace.sourceforge.net/

is the best

it can vectorize and produce directly pdfs, using -b pdf backend

I've run into several books which Adobe's ClearScan chokes on, so I gave potrace a try. While it does seem to be able to vectorize pages which cause Adobe's OCR/Vectorization to throw up its hands, it's not giving me the compression I got from ClearScan. In fact, sometimes potrace even doubles the size of my bitmapped PDF page.

I haven't tried doing anything fancy, like separating text from graphics before vectorizing, so solutions may be out there.

DIY Book Scanner

Vectorization

Vectorization

Re: Vectorization

Re: Vectorization

Re: Vectorization

Re: Vectorization

Re: Vectorization

Re: Vectorization

Re: Vectorization

Re: Vectorization

Re: Vectorization