Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

Problem with "Select Content"

Scan Tailor specific announcements, releases, workflows, tips, etc. NO FEATURE REQUESTS IN THIS FORUM, please.
Post Reply
Evgeny
Posts: 5
Joined: 04 Mar 2014, 00:53

Problem with "Select Content"

Post by Evgeny » 10 Dec 2010, 15:39

Could you give some advice on using a regular scanner rather than a camera?

I am having a lot of pages where Scan Tailor incorrectly determines the text area. Examples: page 12, page 13 and page 16. Sources: http://www.megaupload.com/?d=QXH5XU2B and http://www.megaupload.com/?d=6VHCNIFO.

I am using Epson Perfection V300 scanner and Epson Scan utility. Inside it, I chose Home Mode, Text/Line Art as the document type, grayscale at 600 dpi. By default, it also had Color Control radio button selected with Continuous Auto Exposure checked and Display Gamma at 2.2. All other settings (Descreening,Text Enhancement, etc.) are off. I tried choosing No Color Correction, but Scan Tailor results were not better.

Edit: links.

User avatar
reggilbert
Posts: 49
Joined: 28 Sep 2010, 19:57
Number of books owned: 3000
Location: Buffalo, New York

Re: Problem with "Select Content"

Post by reggilbert » 10 Dec 2010, 18:01

Evgeny wrote:Could you give some advice on using a regular scanner rather than a camera? I am having a lot of pages where Scan Tailor incorrectly determines the text area.
I am no expert, but I have been scanning books on my Plustek Opticbook 3600 for years and recently had a very good experience splitting the images it produces using Scan Tailor. Two suggestions:

1) have you taken maximum advantage of the scanner software's ability to output the cleanest images, so that Scan Tailor has less noise to deal with? On my Opticbook, for example, I can set a global cropping pattern for a job, so there is no bottom-and-right or top-and-left dark areas caused by scanning of the exposed scanner cover outside the edges of the book.

2) you might try 300 dpi. There is this commonly -- almost universally -- agreed on strange phenomenon that OCR accuracy from even the best OCR engines is better at the lower dpi. I don't know why this would affect page splitting in Scan Tailor, but if you do not really need 600 dpi, you can probably speed up your scan times with the lower resolution and maybe Scan Tailor will like it too.

Reg

Anonymous1

Re: Problem with "Select Content"

Post by Anonymous1 » 10 Dec 2010, 19:51

I get these sort of glitches when I have linear streaks or same-colored lines of pixels. For me it was lighting, but I can't help if you are scanning via a scanner. It's the ST algorithm, as it finds vertical lines of pixels (more or less). I would also recommend despeckling. I can provide you with a few scripts (if you run Linux) to accomplish such a task, if needed.

Tulon
Posts: 687
Joined: 03 Oct 2009, 06:13
Number of books owned: 0
Location: London, UK
Contact:

Re: Problem with "Select Content"

Post by Tulon » 11 Dec 2010, 04:17

The artificial white margins cause content selection to fail. If you can convince your scanner not to output those, you'll be fine.
Scan Tailor experimental doesn't output 96 DPI images. It's just what your software shows when DPI information is missing. Usually what you get is input DPI times the resolution enhancement factor.

Evgeny
Posts: 5
Joined: 04 Mar 2014, 00:53

Re: Problem with "Select Content"

Post by Evgeny » 12 Dec 2010, 17:51

Thanks you all for your replies.

It seems to me that Scan Tailor has a great room for improvement in how it detects the content area. I tried different settings in scanning six particular pages, but so far the content was incorrectly determined in at least one page. To a human eye, the top text boundary is absolutely clear; it does not even come close to CAPTCHA puzzles. The performance is especially surprising since ST seems to do a pretty good job in deskewing where it has to detect the left text boundary.
The artificial white margins cause content selection to fail.
In my opinion, this is the same as saying, "The 'Select Content' stage has to be done manually because Scan Tailor's algorithm is not good at this point."
On my Opticbook, for example, I can set a global cropping pattern for a job, so there is no bottom-and-right or top-and-left dark areas caused by scanning of the exposed scanner cover outside the edges of the book.
I can set the cropping area that will apply to all pages, but I can't set it up to millimeter. Otherwise, if a page is tilted a bit, part of it will be outside the area. Are you talking about an area that is fixed or is adjusted to each page?
I can provide you with a few scripts (if you run Linux) to accomplish such a task, if needed.
Thank you, I would appreciate this.

Tulon
Posts: 687
Joined: 03 Oct 2009, 06:13
Number of books owned: 0
Location: London, UK
Contact:

Re: Problem with "Select Content"

Post by Tulon » 12 Dec 2010, 18:39

Evgeny wrote:In my opinion, this is the same as saying, "The 'Select Content' stage has to be done manually because Scan Tailor's algorithm is not good at this point."
Select Content works well for common cases. Yours is not. I am not sure how you ended up with those pure white margins around your page. It's as if you tried to crop it, but specified the crop area larger than the page itself. Interestingly, Scan Tailor wouldn't have a problem with dark or even black margins, because it expects and specifically handles them. You can't easily handle artificial white margins, because you don't know if it's a page plus artificial margins or a big grey picture on a pure white paper.
Evgeny wrote:To a human eye, the top text boundary is absolutely clear
What's easy for human may be quite hard for a computer. For example, deskewing is actually a much simpler operation than Select Content. I would say an order of magnitude simpler.
Scan Tailor experimental doesn't output 96 DPI images. It's just what your software shows when DPI information is missing. Usually what you get is input DPI times the resolution enhancement factor.

Evgeny
Posts: 5
Joined: 04 Mar 2014, 00:53

Re: Problem with "Select Content"

Post by Evgeny » 13 Dec 2010, 15:38

Ah, I see now what white margins you meant. They are there because the scanner's glass was bigger than the pages. I did not try to crop anything. Yes, I can make the scanner software select only the scanned paper because it does not require much precision, and the ST results were much better when I did. Thanks for the tip!

Post Reply