Binarization consists of taking a grayscale image and converting it to a black and white image. This is a useful processing step for later OCR conversion to raw text or as a way of reducing image size.
Here are some notes and links on various techniques for Binarization:
There has been a lot of academic research on binarization. In 2009, a paper was published comparing many of these techniques against one another specifically for binarization of text: http://www.cvc.uab.es/icdar2009/papers/3725b375.pdf
If you find any other new techniques that are worth investigating, especially if they have source code available and would be easy to compare with existing methods, please post them here.
A very good and recent publication : A Robust Document Image Binarization Technique for Degraded Document Images.
I've implemented it in c++ and opencv with windows 7. It works very well for degraded document, but it takes 30s for each image.
Now, it's just tailored to my needs, and all the comments are in french (see the attached file). I'll publish a proper version in late 2014 or in 2015.
If you want to find good binarization algorithms, you have to search "DIBCO" (Document Image Binarization Contest) or "H-DIBCO" (Hand-written Document Image Binarization Contest) or IDCAR (International Conference on Document Analysis and Recognition) on google :
- DIBCO 2011
- DIBCO 2013
- H-DIBCO 2012, and the results
- H-DIBCO 2014, and the results.
- IDCAR 2011
- IDCAR 2013
One of the best algorithm is Document Binarization with Automatic Parameter Tuning from Nicholas R. Howe, with Matlab code, but it's very time consuming (several minutes for each image), and it's hard (for me ) to convert the matlab code in python/opencv or c++/opencv.
FAIR (Fast Algorithm for document Image Restoration) is also a very good algoritm, and you could test it online, but the code is closed, and it's difficult to implement it.
Thanks for linking to all of these extra resources.
xorpt, I especially like the idea of using potrace as a post-processing step for smoothing the letters. After running it and rebinarizing, I noticed that there were many fewer one-pixel jagged ends. The down side seems to be that a lot of the letters have their serifs or crosses shortened a bit from the smoothing. And I notice that the letters seemed to get thicker generally.
jlb, I knew about the original DIBCO competition, but I hadn't realized that they were running one every couple of years. I will be looking more at the later competitions to see what they come up with. I tried FAIR, but my scanned page was too large for the online version and so I gave up on it. Hopefully you will be able to get a chance at some point to translate your 'Robust' implementation into English so that I can have a shot at understanding it.
I've looked at Howe's automatic tuning algorithm and that looks very promising to me. It looks like it can deal with text or lighting variation better than Wolf-Jolion or ST's Otsu variant. And I really like the idea of finding parameters automatically. I want a method that works for well for every book and every page. And tweaking parameters by hand for each one is a lot of work if you want to scan at scale.
I think most of the time spent in Howe's method comes from testing a large space of parameterizations, binarizing the same document dozens or even hundreds of times depending on the search space. But if I am binarizing an entire book at a time, then a lot of things don't change. It is all printed on the same paper, with the same ink and printing process, captured under the same lighting conditions, etc. then maybe I only need to run that parameter search once and then binarize all the rest of the pages just one time each. So binarizing a whole book might be a lot faster than it seems using Howe's method. I'm going to try to translate this into Python/OpenCV. I'll let you know how it goes.
For images that later will be turned into pdf or djvu what really matters is readability in that output format. Acrobat Clearscan vectorizes the binarized input. That is also the goal of smoothscan.
So we can ask: is it always the case that what looks smooth and good to the human eye in a binarized image will also look good (best; most readable) in a vectorized pdf? I don't know, but thought the question worth raising.
dtic, I think this is a good point. As I've learned more about binarization, it has become increasingly clear that there is no 'perfect' solution and that we have to keep our goal in mind. For us, there seem to be three different scenarios:
(a) Pure OCR where the original image is not preserved and only generated text is maintained.
(b) The result of the binarization will be read directly
(c) The result of the binarization will be converted to a vectorized format (via Adobe Reader or potrace)
A separate smoothing step might make sense for (b) while for (a) and (c) it would be neutral or possibly even harmful.
As an aside, I've been seeing some good results with the the Howe method. It seems to do a good job of preserving thin cross lines and small gaps. It also seems to handle faded ink quite well.
jlb, I have been somewhat obsessed about binarization for the last couple of weeks or so. But I have managed to implement Howe's algorithm in Python and verify that the results I am getting are very close to the published ones.
Howe's algorithm came in second place in the most recent DIBCO competition. But I noticed something interesting about the DIBCO data. The ground truth that is the goal in the DIBCO competitions seems to include every pixel with any ink. And when you do that, the letters in a document actually appear bolder and blocker and you lose some of the fine details. In this respect, Howe's algorithm is a bit too good.
I have a variant of Howe's algorithm that I like the looks of a bit better, but which scores worse against the DIBCO data. It seems to work a bit better at capturing thin features and the text does not look bolder than the original. It also doesn't seem to work on as big of a range of documents. For now, I am planning on using this variant even though it performs a bit worse on the objective measure of the DIBCO dataset.
I need to clean up my code a bit, but I plan on publishing links here to both Howe's algorithm in Python and my variant. I will also post a couple of test scans and show them binarized with ST, with Howe's method, and with my variant.
Finally, here is a small variant of Howe's algorithm that I like which biases the result (reducing accuracy) to make the characters a bit thinner and more readable and as a side effect also preserves a few text regions Howe's excludes (look at the 'e' in struggle):