Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

Page numbers being cut off in content selection

Scan Tailor specific announcements, releases, workflows, tips, etc. NO FEATURE REQUESTS IN THIS FORUM, please.
User avatar
Misty
Posts: 481
Joined: 06 Nov 2009, 12:20
Number of books owned: 0
Location: Frozen Wasteland

Page numbers being cut off in content selection

Post by Misty » 30 Apr 2010, 10:51

I've noticed that Scan Tailor sometimes misses page numbers during the content selection phase, usually if there is a bit of whitespace separating them from the main page content. I'm not sure if this is a bug or if it's by design, but I thought I'd mention it so it might be improved. This image shows an example of a page where the page number was missed. The example is from 0.9.8 - I haven't had a chance to upgrade to 0.9.8.1 yet.
page number cut off.png
page number cut off.png (193.18 KiB) Viewed 6349 times
Edit: Here's another page with another problem in the same manuscript. Here, it identified the top of the first paragraph as the top of the page, cutting off a date and the page number as well as half of a photograph. I have other examples of photos being cut off as well.
page number cut off 2.png
page number cut off 2.png (183.2 KiB) Viewed 6348 times
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.

spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: Page numbers being cut off in content selection

Post by spamsickle » 30 Apr 2010, 11:41

"Dropped page numbers" is a bug, and Tulon is aware of it, but it isn't clear how it can be corrected in the current content selection algorithms.

User avatar
Misty
Posts: 481
Joined: 06 Nov 2009, 12:20
Number of books owned: 0
Location: Frozen Wasteland

Re: Page numbers being cut off in content selection

Post by Misty » 30 Apr 2010, 11:52

Okay, thanks. Are the cropped off photos part of the same bug?
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.

spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: Page numbers being cut off in content selection

Post by spamsickle » 30 Apr 2010, 13:52

I would think they're at least somewhat independent, meaning it might be possible to fix one without fixing the other, but I'm just guessing. I haven't done any digging into the nitty gritty of the content selection algorithms, and don't know what is causing the dropped page numbers -- perhaps something similar to "despeckling", perhaps not. I do know that Tulon has acknowledged that it's a problem, somewhere* in that original Scan Tailor thread, but indicated at the time that it probably wouldn't be easy to fix.

* Edit: January 10 post: "The content box selection algorithm often cuts off page numbers and sometimes other kinds of headers / footers. These cases are actually quote complex. If a page number is far away from the rest of page content, how would you tell it apart from a book's edge? Still I haven't given up on this one. Recently I came across a couple (one, two) papers on this subject. Didn't have time to look into them yet, but they give me hope."

Tulon
Posts: 687
Joined: 03 Oct 2009, 06:13
Number of books owned: 0
Location: London, UK
Contact:

Re: Page numbers being cut off in content selection

Post by Tulon » 01 May 2010, 11:04

I wouldn't call it a bug, at least it's no more of a bug than your OCR program making a mistake. The more something is apart from dense text areas, the more likely it to get classified as garbage. The only thing that can save those areas is being classified as text. The success rate of the current "text/not text" classifier is maybe somewhere around 85% for good quality scans (read 300+ DPI) and much lower for bad quality ones.
As for pictures, ST can't tell a large dark area in a picture from a shadow of a book's edge.
Scan Tailor experimental doesn't output 96 DPI images. It's just what your software shows when DPI information is missing. Usually what you get is input DPI times the resolution enhancement factor.

Tulon
Posts: 687
Joined: 03 Oct 2009, 06:13
Number of books owned: 0
Location: London, UK
Contact:

Re: Page numbers being cut off in content selection

Post by Tulon » 01 May 2010, 11:07

The next version will have a feature that will simplify locating pages with parts cut off. It will be possible to sort pages on the "Select Content" stage by width or height of the content box. It's already implemented and you can test it in the latest build: http://www.onlinedisk.ru/file/421091/

Note: it won't work on old projects until you do a full batch processing run on the Select Content stage.
Scan Tailor experimental doesn't output 96 DPI images. It's just what your software shows when DPI information is missing. Usually what you get is input DPI times the resolution enhancement factor.

User avatar
Misty
Posts: 481
Joined: 06 Nov 2009, 12:20
Number of books owned: 0
Location: Frozen Wasteland

Re: Page numbers being cut off in content selection

Post by Misty » 03 May 2010, 09:29

That makes sense. ST usually does quite good picture detection in mixed mode. Does it depend on content selection being done beforehand, or is that something that could be used to help determine what is an illustration to keep in content detection?

Speaking of picture detection, I had a question about that. It does a good job most of the time, but I've noticed that light-coloured images tend to have some of their lightest portions skipped over. When that happens, it's usually the case that one of the edges has a light portion (for instance, the sky in a photo) that is interpreted as page background; even if it is only a very small part on the edge, that then "leaks" into the image for the entire interior light part. Most of the photos and illustrations in the books I'm scanning are roughly square or rectangular, so it seems that this aspect of the detection would be predictable. Would it be feasible to add a mode where ST prioritizes either rectangular illustrations, or ones with all closed edges to help prevent this?
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.

Tulon
Posts: 687
Joined: 03 Oct 2009, 06:13
Number of books owned: 0
Location: London, UK
Contact:

Re: Page numbers being cut off in content selection

Post by Tulon » 03 May 2010, 09:49

Misty wrote:That makes sense. ST usually does quite good picture detection in mixed mode. Does it depend on content selection being done beforehand, or is that something that could be used to help determine what is an illustration to keep in content detection?
Picture detection depends on shadows being removed, which in turn depends on knowing the content box.
Misty wrote:Speaking of picture detection, I had a question about that. It does a good job most of the time, but I've noticed that light-coloured images tend to have some of their lightest portions skipped over. When that happens, it's usually the case that one of the edges has a light portion (for instance, the sky in a photo) that is interpreted as page background; even if it is only a very small part on the edge, that then "leaks" into the image for the entire interior light part. Most of the photos and illustrations in the books I'm scanning are roughly square or rectangular, so it seems that this aspect of the detection would be predictable. Would it be feasible to add a mode where ST prioritizes either rectangular illustrations, or ones with all closed edges to help prevent this?
Feature requests are being ignored at this point. I've already got more than I can possibly do in my lifetime.
Scan Tailor experimental doesn't output 96 DPI images. It's just what your software shows when DPI information is missing. Usually what you get is input DPI times the resolution enhancement factor.

User avatar
Misty
Posts: 481
Joined: 06 Nov 2009, 12:20
Number of books owned: 0
Location: Frozen Wasteland

Re: Page numbers being cut off in content selection

Post by Misty » 03 May 2010, 10:11

Tulon wrote:Feature requests are being ignored at this point. I've already got more than I can possibly do in my lifetime.
I understand. I know you said you were not going to be taking feature requests at this point in time. I'm just wondering about planned improvements to the feature, if any. I understand if that aspect is not a priority for you.

I appreciate all the hard work you've put into this program, Tulon.
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.

Tim

Re: Page numbers being cut off in content selection

Post by Tim » 12 May 2010, 12:38

Tulon wrote: Feature requests are being ignored at this point. I've already got more than I can possibly do in my lifetime.
What we really need to do is recruit one or more skilled C++ developers for you. I should imagine we could find some that would be excited about DIY book scanning and get them hooked into helping scantailor that way. Daniel, you get to talk about book scanning a bit, do you ever get a chance to solicit development help?

Post Reply