Vectorization
Moderator: peterZ
Vectorization
hello
would vectorization of scanned/photographed text, improve OCR results?
would vectorization of scanned/photographed text, improve OCR results?
Re: Vectorization
It wouldn't. OCR engines work with raster images only, so you would be vectorizing your text first and then rasterizing it again.
Scan Tailor experimental doesn't output 96 DPI images. It's just what your software shows when DPI information is missing. Usually what you get is input DPI times the resolution enhancement factor.
-
- Posts: 596
- Joined: 06 Jun 2009, 23:57
Re: Vectorization
You can vectorize your PDF at the same time you OCR it with Adobe Acrobat 9's "Clearscan" option. I never really cared much about OCR, but I've started doing it just because it's included in the vectorization AA9 performs. The big advantage, for me, is that the vectorized PDF typically takes up 1/10th of the space on my hard drive that the raster PDF took.
Re: Vectorization
I'd have to agree for the most part with the assessments above. Remember though that vectorizing an image significantly smooths out the characters. So it's possible that vectorizing and then converting back to a 1-bit bitonal image might help some small degree. I don't see a difference, here is the data:
Raster Image: OCR Output: The Vectorized image is too big for the site, here is the OCR output: The Vectorized Bi-Tonal: The Vectorized Bi-Tonal OCR:
Raster Image: OCR Output: The Vectorized image is too big for the site, here is the OCR output: The Vectorized Bi-Tonal: The Vectorized Bi-Tonal OCR:
Re: Vectorization
Also, I have to second the vote for clearscan. It makes the text really clean and saves a lot of space. It would be great if you could correct OCR errors however, which does not seem to be possible with the clearscan option (no training possible either).
-
- Posts: 290
- Joined: 20 Jun 2009, 12:19
- E-book readers owned: SONY PRS-505, Kindle DX
- Number of books owned: 9999
- Location: Grand Rapids, MI
- Contact:
Re: Vectorization
OCR is a pattern recognition problem. As such, it works in two parts: feature extraction and pattern classification. These two parts are tuned to work with one another. Feature extraction takes the raster and transforms it into a more compact representation. This more compact representation (hopefully) enhances the meaningful parts of the raster and suppresses the unmeaningful parts, making the pattern classification job easier. The problem with vectorization is that it can fight with feature extraction. In a sense, vectorization is another form of feature extraction, but it is set up for another purpose.
I believe vectorization can be useful in a different context, however. The Kindle's TOPAZ format is a vectorization of scanned imagery. If your reason for doing book scanning is to read on your Kindle, then all you have to do is take your vectorization and convert it to the TOPAZ format. Good luck doing that since the format is Amazon proprietary. You'll need a fair bit of reverse-engineering smarts.
Does anybody know whether PDF handles vector formats like, say SVG?
I believe vectorization can be useful in a different context, however. The Kindle's TOPAZ format is a vectorization of scanned imagery. If your reason for doing book scanning is to read on your Kindle, then all you have to do is take your vectorization and convert it to the TOPAZ format. Good luck doing that since the format is Amazon proprietary. You'll need a fair bit of reverse-engineering smarts.
Does anybody know whether PDF handles vector formats like, say SVG?
- dingodog
- Posts: 110
- Joined: 22 Jul 2010, 18:19
- Number of books owned: 1000
- Country: on the net
- Location: on the net
- Contact:
Re: Vectorization
PDF is a mere container for files
svg can be stored
with
*uniconvertor* (needs python)
- http://sk1project.org/modules.php?name= ... iconvertor
or *Prince xml* (free for personal use)
- http://princexml.com/
svg can be converted in pdf
svg can be stored
with
*uniconvertor* (needs python)
- http://sk1project.org/modules.php?name= ... iconvertor
or *Prince xml* (free for personal use)
- http://princexml.com/
svg can be converted in pdf
- dingodog
- Posts: 110
- Joined: 22 Jul 2010, 18:19
- Number of books owned: 1000
- Country: on the net
- Location: on the net
- Contact:
Re: Vectorization
for Black and white scans of books,
*potrace*
- http://potrace.sourceforge.net/
is the best
it can vectorize and produce directly pdfs, using -b pdf backend
*potrace*
- http://potrace.sourceforge.net/
is the best
it can vectorize and produce directly pdfs, using -b pdf backend
Code: Select all
potrace 1.8. Transforms bitmaps into vector graphics.
Usage: potrace [options] [file...]
General options:
-h, --help - print this help message and exit
-v, --version - print version info and exit
-l, --license - print license info and exit
-V, --show-defaults - print compiled-in defaults and exit
--progress - show progress bar
Input/output options:
-o, --output <file> - output to file
Backend selection:
-e, --eps - EPS backend (encapsulated postscript) (default)
-p, --postscript - Postscript backend
-s, --svg - SVG backend (scalable vector graphics)
-g, --pgm - PGM backend (portable greymap)
-b, --backend <name> - select backend by name
Algorithm options:
-z, --turnpolicy <policy> - how to resolve ambiguities in path decomposition
-t, --turdsize <n> - suppress speckles of up to this size (default 2)
-a, --alphamax <n> - corner threshold parameter (default 1)
-n, --longcurve - turn off curve optimization
-O, --opttolerance <n> - curve optimization tolerance (default 0.2)
-u, --unit <n> - quantize output to 1/unit pixels (default 10)
-d, --debug <n> - produce debugging output of type n (n=1,2,3)
Scaling and placement options:
-W, --width <dim> - width of output image
-H, --height <dim> - height of output image
-r, --resolution <n>[x<n>] - resolution (in dpi)
-x, --scale <n>[x<n>] - scaling factor (pgm backend)
-S, --stretch <n> - yresolution/xresolution
-A, --rotate <angle> - rotate counterclockwise by angle
-M, --margin <dim> - margin
-L, --leftmargin <dim> - left margin
-R, --rightmargin <dim> - right margin
-T, --topmargin <dim> - top margin
-B, --bottommargin <dim> - bottom margin
Output options, supported by some backends:
-C, --color #rrggbb - set line color (default black)
--fillcolor #rrggbb - set fill color (default transparent)
--opaque - make white shapes opaque
--group - group related paths together
Postscript/EPS options:
-P, --pagesize <format> - page size (default is letter)
-c, --cleartext - do not compress the output
-2, --level2 - use postscript level 2 compression (default)
-3, --level3 - use postscript level 3 compression
-q, --longcoding - do not optimize for file size
PGM options:
-G, --gamma <n> - gamma value for anti-aliasing (default 2.2)
Frontend options:
-k, --blacklevel <n> - black/white cutoff in input file (default 0.5)
-i, --invert - invert bitmap
Dimensions can have optional units, e.g. 6.5in, 15cm, 100pt.
Default is inches (or pixels for pgm and gimppath backends).
Possible input file formats are: pnm (pbm, pgm, ppm), bmp.
Backends are: eps, postscript, ps, pdf, svg, pgm, gimppath, xfig.
Re: Vectorization
potrace works quite nice
input: output (default settings)
input: output (default settings)
-
- Posts: 596
- Joined: 06 Jun 2009, 23:57
Re: Vectorization
I've run into several books which Adobe's ClearScan chokes on, so I gave potrace a try. While it does seem to be able to vectorize pages which cause Adobe's OCR/Vectorization to throw up its hands, it's not giving me the compression I got from ClearScan. In fact, sometimes potrace even doubles the size of my bitmapped PDF page.dingodog wrote:for Black and white scans of books,
*potrace*
- http://potrace.sourceforge.net/
is the best
it can vectorize and produce directly pdfs, using -b pdf backend
I haven't tried doing anything fancy, like separating text from graphics before vectorizing, so solutions may be out there.