Vectorization

Convert page images into searchable text. Talk about software, techniques, and new developments here.

Moderator: peterZ

scanner
Posts: 9
Joined: 14 Jun 2010, 19:14

Vectorization

Post by scanner »

hello

would vectorization of scanned/photographed text, improve OCR results?
Tulon
Posts: 687
Joined: 03 Oct 2009, 06:13
Number of books owned: 0
Location: London, UK
Contact:

Re: Vectorization

Post by Tulon »

It wouldn't. OCR engines work with raster images only, so you would be vectorizing your text first and then rasterizing it again.
Scan Tailor experimental doesn't output 96 DPI images. It's just what your software shows when DPI information is missing. Usually what you get is input DPI times the resolution enhancement factor.
spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: Vectorization

Post by spamsickle »

You can vectorize your PDF at the same time you OCR it with Adobe Acrobat 9's "Clearscan" option. I never really cared much about OCR, but I've started doing it just because it's included in the vectorization AA9 performs. The big advantage, for me, is that the vectorized PDF typically takes up 1/10th of the space on my hard drive that the raster PDF took.
Shaknum
Posts: 91
Joined: 16 Aug 2010, 13:10

Re: Vectorization

Post by Shaknum »

I'd have to agree for the most part with the assessments above. Remember though that vectorizing an image significantly smooths out the characters. So it's possible that vectorizing and then converting back to a 1-bit bitonal image might help some small degree. I don't see a difference, here is the data:

Raster Image:
Test-Raster.tif
(346.25 KiB) Downloaded 123 times
OCR Output:
Test - OCR - Raster.rtf
(5.1 KiB) Downloaded 648 times
The Vectorized image is too big for the site, here is the OCR output:
Test - OCR - Vector.rtf
(5.1 KiB) Downloaded 788 times
The Vectorized Bi-Tonal:
Test-Vector-BiTonal.tif
(336.53 KiB) Downloaded 123 times
The Vectorized Bi-Tonal OCR:
Test - OCR - Vector-BiTonal.rtf
(5.22 KiB) Downloaded 744 times
Shaknum
Posts: 91
Joined: 16 Aug 2010, 13:10

Re: Vectorization

Post by Shaknum »

Also, I have to second the vote for clearscan. It makes the text really clean and saves a lot of space. It would be great if you could correct OCR errors however, which does not seem to be possible with the clearscan option (no training possible either).
StevePoling
Posts: 290
Joined: 20 Jun 2009, 12:19
E-book readers owned: SONY PRS-505, Kindle DX
Number of books owned: 9999
Location: Grand Rapids, MI
Contact:

Re: Vectorization

Post by StevePoling »

OCR is a pattern recognition problem. As such, it works in two parts: feature extraction and pattern classification. These two parts are tuned to work with one another. Feature extraction takes the raster and transforms it into a more compact representation. This more compact representation (hopefully) enhances the meaningful parts of the raster and suppresses the unmeaningful parts, making the pattern classification job easier. The problem with vectorization is that it can fight with feature extraction. In a sense, vectorization is another form of feature extraction, but it is set up for another purpose.

I believe vectorization can be useful in a different context, however. The Kindle's TOPAZ format is a vectorization of scanned imagery. If your reason for doing book scanning is to read on your Kindle, then all you have to do is take your vectorization and convert it to the TOPAZ format. Good luck doing that since the format is Amazon proprietary. You'll need a fair bit of reverse-engineering smarts.

Does anybody know whether PDF handles vector formats like, say SVG?
User avatar
dingodog
Posts: 110
Joined: 22 Jul 2010, 18:19
Number of books owned: 1000
Country: on the net
Location: on the net
Contact:

Re: Vectorization

Post by dingodog »

PDF is a mere container for files

svg can be stored

with

*uniconvertor* (needs python)
- http://sk1project.org/modules.php?name= ... iconvertor

or *Prince xml* (free for personal use)
- http://princexml.com/

svg can be converted in pdf
User avatar
dingodog
Posts: 110
Joined: 22 Jul 2010, 18:19
Number of books owned: 1000
Country: on the net
Location: on the net
Contact:

Re: Vectorization

Post by dingodog »

for Black and white scans of books,

*potrace*
- http://potrace.sourceforge.net/

is the best

it can vectorize and produce directly pdfs, using -b pdf backend

Code: Select all

potrace 1.8. Transforms bitmaps into vector graphics.

Usage: potrace [options] [file...]
General options:
 -h, --help                 - print this help message and exit
 -v, --version              - print version info and exit
 -l, --license              - print license info and exit
 -V, --show-defaults        - print compiled-in defaults and exit
 --progress                 - show progress bar
Input/output options:
 -o, --output <file>        - output to file
Backend selection:
 -e, --eps                  - EPS backend (encapsulated postscript) (default)
 -p, --postscript           - Postscript backend
 -s, --svg                  - SVG backend (scalable vector graphics)
 -g, --pgm                  - PGM backend (portable greymap)
 -b, --backend <name>       - select backend by name
Algorithm options:
 -z, --turnpolicy <policy>  - how to resolve ambiguities in path decomposition
 -t, --turdsize <n>         - suppress speckles of up to this size (default 2)
 -a, --alphamax <n>         - corner threshold parameter (default 1)
 -n, --longcurve            - turn off curve optimization
 -O, --opttolerance <n>     - curve optimization tolerance (default 0.2)
 -u, --unit <n>             - quantize output to 1/unit pixels (default 10)
 -d, --debug <n>            - produce debugging output of type n (n=1,2,3)
Scaling and placement options:
 -W, --width <dim>          - width of output image
 -H, --height <dim>         - height of output image
 -r, --resolution <n>[x<n>] - resolution (in dpi)
 -x, --scale <n>[x<n>]      - scaling factor (pgm backend)
 -S, --stretch <n>          - yresolution/xresolution
 -A, --rotate <angle>       - rotate counterclockwise by angle
 -M, --margin <dim>         - margin
 -L, --leftmargin <dim>     - left margin
 -R, --rightmargin <dim>    - right margin
 -T, --topmargin <dim>      - top margin
 -B, --bottommargin <dim>   - bottom margin
Output options, supported by some backends:
 -C, --color #rrggbb        - set line color (default black)
 --fillcolor #rrggbb        - set fill color (default transparent)
 --opaque                   - make white shapes opaque
 --group                    - group related paths together
Postscript/EPS options:
 -P, --pagesize <format>    - page size (default is letter)
 -c, --cleartext            - do not compress the output
 -2, --level2               - use postscript level 2 compression (default)
 -3, --level3               - use postscript level 3 compression
 -q, --longcoding           - do not optimize for file size
PGM options:
 -G, --gamma <n>            - gamma value for anti-aliasing (default 2.2)
Frontend options:
 -k, --blacklevel <n>       - black/white cutoff in input file (default 0.5)
 -i, --invert               - invert bitmap

Dimensions can have optional units, e.g. 6.5in, 15cm, 100pt.
Default is inches (or pixels for pgm and gimppath backends).
Possible input file formats are: pnm (pbm, pgm, ppm), bmp.
Backends are: eps, postscript, ps, pdf, svg, pgm, gimppath, xfig.
User avatar
Gerard
Posts: 154
Joined: 17 Oct 2010, 07:15
Number of books owned: 0
Location: Berlin (Germany)

Re: Vectorization

Post by Gerard »

potrace works quite nice
input:
input.png
input.png (3 KiB) Viewed 15460 times
output (default settings)
potrace output.png
potrace output.png (10.81 KiB) Viewed 15460 times
spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: Vectorization

Post by spamsickle »

dingodog wrote:for Black and white scans of books,

*potrace*
- http://potrace.sourceforge.net/

is the best

it can vectorize and produce directly pdfs, using -b pdf backend
I've run into several books which Adobe's ClearScan chokes on, so I gave potrace a try. While it does seem to be able to vectorize pages which cause Adobe's OCR/Vectorization to throw up its hands, it's not giving me the compression I got from ClearScan. In fact, sometimes potrace even doubles the size of my bitmapped PDF page.

I haven't tried doing anything fancy, like separating text from graphics before vectorizing, so solutions may be out there.
Post Reply