Include/exclude areas in the content selection stage

Benedictus · Post by **Benedictus** » 09 Feb 2013, 11:24

The content selection stage is usually the slowest and most tedious stage of the book processing workflow with ST. In it, when ST fails to recognize the content, it is usually because either:

1. It misses a part of the content which is far from the main text body. This is common in the pages where the page number is in the bottom of the page and the text takes only some lines in the upper part of the page. Tipically, this happens in the endings of chapters; or because

2. It includes in the content stuff that isn't text nor images. It happens to me a lot when I scan books that I have underlined when I read them, and when scanning clear books there is the occasional dot, line or spot in the book that is detected as part of the content if it is near the text.

I have two ways of dealing with this. One of them helps ST a lot, can be done very quickly and can overcome the underlining problem (but not other problems) almost completely. The other is painfully slow and tedious:

1. The fast workaround is miraculous when it comes to select the content of dirty (i.e., underlined) pages. It basically consists in feeding ST with clean files which are identical to the actual files we want ST to output with, let it work with them and finally making it work with the original files only in the output stage. This is how I do it:

a. I scan the files (in grayscale, obviously) and load them in ABBYY FineReader (which is the software I use for this). There I deskew them (it does it better than ST in my opinion) and save the files. Then I apply them some aggresive levels. This way virtually all the dirt is gone, but the text boundaries are clear enough for ST to select content correctly.

b. I now save the "leveled" files and go through stages 1-5 in ST with them. The files must have the exact same names than the files we saved in the step a. This will allow relinking later. After every step in ST everything (specially content selection and margins) must be set to manual. This is what allows us to apply the work done on clean files to the original files.

c. Once the content has been selected, the margins set and only the output stage is remaining, I relink (Tools > Relinking...") the project with the original deskewed files saved in step a and simply let ST process the output stage.

1. The slow and tedious way is simply editing the images with a image processing software. I use GIMP:

a. For parts that are ignored by ST I just draw lines that run from the text which is being correctly detected to the part of the page which ST ignores and overwrite the files. After this, when ST does auto content detection, it includes the part that it originally missed. For example:

In this page only the text in the upper part of the page was being detected. After drawing the lines, the whole content is detected. I do this instead of just dragging the content selection margins because I like the accuracy of ST when it comes to select the content and I know I won't be so accurate. Besides, having the files drawn doesn't matter, because these are the disposable files I talked about earlier.

b. For the issue 2 (stuff which is not text nor images being incorrectly detected as content), I just erase the offending bits with GIMP, save the files and let ST autodetect again.

I'd like to have these 2 last features included in ST in the form of the ability to include or exclude selected areas from the automatic content selection. The latter more like "try to autodetect also in this area" rather than the "include exactly this" the margin dragging does. This is more or less what ST already does with images in mixed mode, so it won't be hard to implement and would speed a lot the content selection stage.

So what do you think? Would you like this is ST? It would be the icing in the cake IMO.

dtic · Post by **dtic** » 09 Feb 2013, 13:03

When ST fails to select parts of the text, like page numbers at the bottom, you can speed up handling of that within ST in this way:
1. run the batch selection in ST
2. Run this script http://diybookscanner.org/forum/viewtop ... =21&t=2698
3. Sort selection thumbnails according to height (lowest selection height at the top)
4. Now fix the problematic selections image by image. Keep the mouse pointer on the large selection preview. Click and drag the selection where needed and press A to finish the selection before you release the mouse click. Repeat this for the next image. Use W/Q to move to next/previous image, not the page up/down buttons.

The main developer of Scan Tailor has asked for no requests to be made in the forum here. On top of that Scan Tailor isn't actively developed for the time being and the latest release was about one year ago http://scantailor.sourceforge.net/?q=node/30 However ST is open source so things might change if other people join in and develop it further. Your best chance at the moment is probably that someone comes along that wants the same features that you want and knows how to code them. If any such person reads this: One way to solve the problem I've discussed would be to add a control+click action that makes ST expand the selection to include the control+clicked position.

Benedictus · Post by **Benedictus** » 09 Feb 2013, 13:14

dtic wrote:The main developer of Scan Tailor has asked for no requests to be made in the forum here. On top of that Scan Tailor isn't actively developed for the time being and the latest release was about one year ago

The request forum is locked, at least for me. I know that ST isn't being developed, but I think that the Enhanced branch is. Am I right? In such case, my request would be for ST Enhanced.

The workaround is very nice. Thank you very much. However, I still think that including and excluding areas for autodetection would be better.

dtic · Post by **dtic** » 09 Feb 2013, 13:46

Well, new alpha releases of scan tailor enhanced keep showing up on Sourceforge. But there are no release notes and the forum member Pejuko, who has developed the enhanced version, hasn't posted here in a while. See this thread http://diybookscanner.org/forum/viewtop ... 4&start=70

Benedictus wrote:However, I still think that including and excluding areas for autodetection would be better.

Me too. But I found no way to do that in a script and settled for the second best.

spamsickle · Post by **spamsickle** » 12 Feb 2013, 11:15

For the type 2 stuff (noise incorrectly identified as content) Scan Tailor has the ability to draw exclusion zones (I forget the term Scan Tailor uses, it's been a while since I've used it) which is probably easier than editing them in GIMP and going through the autodetect again. As I recall, these were specified after content detection. I've used the feature to erase punched holes in 3-ring binder scans, for instance.

DIY Book Scanner

Include/exclude areas in the content selection stage

Include/exclude areas in the content selection stage

Re: Include/exclude areas in the content selection stage

Re: Include/exclude areas in the content selection stage

Re: Include/exclude areas in the content selection stage

Re: Include/exclude areas in the content selection stage