Binarization

DIY Book Scanner Skunk Works. Share your crazy ideas and novel approaches. Home of the "3D structure of a book" thread.

Moderator: peterZ

duerig
Posts: 388
Joined: 01 Jun 2014, 17:04
Number of books owned: 1000
Country: United States of America

Binarization

Post by duerig »

Binarization consists of taking a grayscale image and converting it to a black and white image. This is a useful processing step for later OCR conversion to raw text or as a way of reducing image size.

Here are some notes and links on various techniques for Binarization:
If you find any other new techniques that are worth investigating, especially if they have source code available and would be easy to compare with existing methods, please post them here.
xorpt
Posts: 42
Joined: 24 Feb 2012, 01:37
E-book readers owned: Sony PRS-T1
Number of books owned: 2000

Re: Binarization

Post by xorpt »

I made a thread on this forum a while back about binarization in photoshop, Gimp and IM:

http://www.diybookscanner.org/forum/vie ... =19&t=2554

but this is not theoretical at all...
jlb
Posts: 5
Joined: 08 Jun 2014, 15:51
Number of books owned: 0
Country: France

Re: Binarization

Post by jlb »

A very good and recent publication : A Robust Document Image Binarization Technique for Degraded Document Images.
I've implemented it in c++ and opencv with windows 7. It works very well for degraded document, but it takes 30s for each image.
Now, it's just tailored to my needs, and all the comments are in french (see the attached file). I'll publish a proper version in late 2014 or in 2015.
Attachments
main.cpp
(8.37 KiB) Downloaded 785 times
jlb
Posts: 5
Joined: 08 Jun 2014, 15:51
Number of books owned: 0
Country: France

Re: Binarization

Post by jlb »

If you want to find good binarization algorithms, you have to search "DIBCO" (Document Image Binarization Contest) or "H-DIBCO" (Hand-written Document Image Binarization Contest) or IDCAR (International Conference on Document Analysis and Recognition) on google :
- DIBCO 2011
- DIBCO 2013
- H-DIBCO 2012, and the results
- H-DIBCO 2014, and the results.
- IDCAR 2011
- IDCAR 2013

One of the best algorithm is Document Binarization with Automatic Parameter Tuning from Nicholas R. Howe, with Matlab code, but it's very time consuming (several minutes for each image), and it's hard (for me ) to convert the matlab code in python/opencv or c++/opencv.

FAIR (Fast Algorithm for document Image Restoration) is also a very good algoritm, and you could test it online, but the code is closed, and it's difficult to implement it.
duerig
Posts: 388
Joined: 01 Jun 2014, 17:04
Number of books owned: 1000
Country: United States of America

Re: Binarization

Post by duerig »

Thanks for linking to all of these extra resources.

xorpt, I especially like the idea of using potrace as a post-processing step for smoothing the letters. After running it and rebinarizing, I noticed that there were many fewer one-pixel jagged ends. The down side seems to be that a lot of the letters have their serifs or crosses shortened a bit from the smoothing. And I notice that the letters seemed to get thicker generally.

jlb, I knew about the original DIBCO competition, but I hadn't realized that they were running one every couple of years. I will be looking more at the later competitions to see what they come up with. I tried FAIR, but my scanned page was too large for the online version and so I gave up on it. Hopefully you will be able to get a chance at some point to translate your 'Robust' implementation into English so that I can have a shot at understanding it.

I've looked at Howe's automatic tuning algorithm and that looks very promising to me. It looks like it can deal with text or lighting variation better than Wolf-Jolion or ST's Otsu variant. And I really like the idea of finding parameters automatically. I want a method that works for well for every book and every page. And tweaking parameters by hand for each one is a lot of work if you want to scan at scale.

I think most of the time spent in Howe's method comes from testing a large space of parameterizations, binarizing the same document dozens or even hundreds of times depending on the search space. But if I am binarizing an entire book at a time, then a lot of things don't change. It is all printed on the same paper, with the same ink and printing process, captured under the same lighting conditions, etc. then maybe I only need to run that parameter search once and then binarize all the rest of the pages just one time each. So binarizing a whole book might be a lot faster than it seems using Howe's method. I'm going to try to translate this into Python/OpenCV. I'll let you know how it goes.
dtic
Posts: 464
Joined: 06 Mar 2010, 18:03

Re: Binarization

Post by dtic »

For images that later will be turned into pdf or djvu what really matters is readability in that output format. Acrobat Clearscan vectorizes the binarized input. That is also the goal of smoothscan.

So we can ask: is it always the case that what looks smooth and good to the human eye in a binarized image will also look good (best; most readable) in a vectorized pdf? I don't know, but thought the question worth raising.
duerig
Posts: 388
Joined: 01 Jun 2014, 17:04
Number of books owned: 1000
Country: United States of America

Re: Binarization

Post by duerig »

dtic, I think this is a good point. As I've learned more about binarization, it has become increasingly clear that there is no 'perfect' solution and that we have to keep our goal in mind. For us, there seem to be three different scenarios:

(a) Pure OCR where the original image is not preserved and only generated text is maintained.
(b) The result of the binarization will be read directly
(c) The result of the binarization will be converted to a vectorized format (via Adobe Reader or potrace)

A separate smoothing step might make sense for (b) while for (a) and (c) it would be neutral or possibly even harmful.

As an aside, I've been seeing some good results with the the Howe method. It seems to do a good job of preserving thin cross lines and small gaps. It also seems to handle faded ink quite well.
duerig
Posts: 388
Joined: 01 Jun 2014, 17:04
Number of books owned: 1000
Country: United States of America

Re: Binarization

Post by duerig »

jlb, I have been somewhat obsessed about binarization for the last couple of weeks or so. But I have managed to implement Howe's algorithm in Python and verify that the results I am getting are very close to the published ones.

Howe's algorithm came in second place in the most recent DIBCO competition. But I noticed something interesting about the DIBCO data. The ground truth that is the goal in the DIBCO competitions seems to include every pixel with any ink. And when you do that, the letters in a document actually appear bolder and blocker and you lose some of the fine details. In this respect, Howe's algorithm is a bit too good.

I have a variant of Howe's algorithm that I like the looks of a bit better, but which scores worse against the DIBCO data. It seems to work a bit better at capturing thin features and the text does not look bolder than the original. It also doesn't seem to work on as big of a range of documents. For now, I am planning on using this variant even though it performs a bit worse on the objective measure of the DIBCO dataset.

I need to clean up my code a bit, but I plan on publishing links here to both Howe's algorithm in Python and my variant. I will also post a couple of test scans and show them binarized with ST, with Howe's method, and with my variant.
jlb
Posts: 5
Joined: 08 Jun 2014, 15:51
Number of books owned: 0
Country: France

Re: Binarization

Post by jlb »

duerig wrote:I have managed to implement Howe's algorithm in Python and verify that the results I am getting are very close to the published ones
It will be very intersting to compare the results with differents algorithms !
duerig
Posts: 388
Joined: 01 Jun 2014, 17:04
Number of books owned: 1000
Country: United States of America

Re: Binarization

Post by duerig »

Here is a comparison of the current binarization algorithms I've been exploring.

First, there is the original document:
Original
Original
original.png (555.7 KiB) Viewed 17399 times
ScanTailor is based on Otsu thresholding with some extra pre-processing and a bit of smoothing afterward:
Scan Tailor
Scan Tailor
st.png (44.59 KiB) Viewed 17399 times
The Wolf-Jolion method using dynamic thresholding:
Wolf-Jolion
Wolf-Jolion
wolf.png (34.61 KiB) Viewed 17399 times
I implemented Howe's binarization in Python at https://github.com/duerig/laser-dewarp/ ... s/binarize which is based on mincut/maxflow:
Howe
Howe
howe.png (34.88 KiB) Viewed 17399 times
Finally, here is a small variant of Howe's algorithm that I like which biases the result (reducing accuracy) to make the characters a bit thinner and more readable and as a side effect also preserves a few text regions Howe's excludes (look at the 'e' in struggle):
Thin Howe Variant
Thin Howe Variant
thin.png (33.26 KiB) Viewed 17399 times
Post Reply