Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

Page Layout problems...

Scan Tailor specific announcements, releases, workflows, tips, etc. NO FEATURE REQUESTS IN THIS FORUM, please.
TJ-Shredder
Posts: 15
Joined: 04 Mar 2014, 00:53

Page Layout problems...

Post by TJ-Shredder » 24 May 2010, 09:35

Hi to all scan tailors,

I am new to book scanning, but I did get some books in quite well already. I am using a 12 Megapixel Samsung camera and basically scan tailor is doing a good job. Its running on a MacBook with OS X 10.6.3.
But I don't get the Page Layout with a certain book setup in any usable way.
First, I don't want to change the layout! It should remain the same as it is in the book itself. Basically I don't want/need to set any margins, as each page is different. (In most books I want to scan its different for different pages...)
I want one side be cut off at the line defined by the split pages stage (always, no exception...). The other sides I wanted to cut off at the page layout stage. But its almost impossible. Scan tailor tries things I don't want it to do. It should be easy to just look for the borders (my background is black or significantly darker than the page). And cut them off...
Its driving me mad, that the link for the margins is always set, I didn't find a way to have it per default unchecked. (I'd prefer to not be there at all, there is no need for linking margins...)
There is also a bad bug at least on the OS X version I compiled: if I hit one of the little arrows to fine tune the margins, it doesn't stop changing the margin and adding more and more to it until I finally have to force quit scan tailor (all changes lost again...)
I'd prefer also that the first page simply defines the size for all pages. This would simplify the whole analysis process as well, it would only have to look for a single border and just cut it off there and according to the defined size of the first page cut of the other sides. I guess that the size of the pages for a book remain the same throughout the pages. This simple information should be considered for any analysis...
But maybe I just don't get how it has to be done, I could also imagine that stage 4 "select content" does the analysis for finding the borders of the page. It seems to me that this could be done by almost the same algorithm. I think in general it should be way easier than deskewing or finding the content...
It would be nice if Scan tailor would find out itself if a page is blank... This is important for the deskew stage, it fails completely with an empty page, though the border has a highe contrast. Maybe we should tell scan tailor somehow the rough size and the maximum deskew angle which could be expected. Its certainly way smaller than it turns my blank pages, and it could help in all analysis stages to avoid unlikely interpretations...
I could also imagine to help scan tailor by telling where to expect page numbers. They usually follow a consistent rule throughout a book. It could protect those areas for elimination in stage 4 or 6...

I just wonder how others do it, maybe I have to install "unpaper" and try that. I did install it, but I have no idea how to use it...

All the best for this fun project...

Stefan

TJ-Shredder
Posts: 15
Joined: 04 Mar 2014, 00:53

Re: Page Layout problems...

Post by TJ-Shredder » 24 May 2010, 13:15

Just to add to my own post, I think it only needs one addition in the select content stage. Have an option there to use the borders of the book as content definition, instead of the algorithm that searches for text blocks. Then I could simply apply zero margins in the layout stage and all is good...
Another solution could be to set a minimum size for content and center it to the found content, but it would be much less usefull than just finding the borders of the page...
I guess the existing select content stage is optimized for scanning magazines with a flatbed scanner. I tailored some scans from a german electronic magazin, and for that purpose the margin idea works perfect, as with this layout the margins in the original are all the same. But with the other book, the select content will literally destroy all efforts, If the last stage fails, all work to this point is lost as well - I have to wait to tailor these photos till there is another method...
If there are 10 pages out of 400 which are not perfect, I can adjust settings, no problem, but if there are 390 wrong out of 400 there is no automation anymore. If I could simply skip the layout stage, I still would be faster to do the cropping by hand in a normal graphics program, but I could not find a way to do that instead...

Tulon
Posts: 687
Joined: 03 Oct 2009, 06:13
Number of books owned: 0
Location: London, UK
Contact:

Re: Page Layout problems...

Post by Tulon » 24 May 2010, 14:14

TJ-Shredder,

At this point new feature requests are not being considered - I already got enough for years to come.

Having said that, why do you insist on having margins exactly like in the original? Scan Tailor was designed based on the assumption that in 99% of cases that wouldn't matter.
Scan Tailor experimental doesn't output 96 DPI images. It's just what your software shows when DPI information is missing. Usually what you get is input DPI times the resolution enhancement factor.

TJ-Shredder
Posts: 15
Joined: 04 Mar 2014, 00:53

Re: Page Layout problems...

Post by TJ-Shredder » 25 May 2010, 03:19

Its mostly that one book of several hundred pages which just doesn't work at all. Each chapter has a single page header, centered and a page number on the bottom. That is quite common for many books. Then there are pages with different widths of text. no matter what margins I choose, on the majority of pages it will move the image in a way that it shows only a small part of the original page and a lot of space outside the original page. That of course destroys the original layout completely as well. Somebody put effort into the layout, I want to keep it...
And keeping the original layout would not matter in maybe 99.5 % of all cases...;-)
If I know the borders of the page, it should be easy as well. I bet the algorithm would be close to the excellent split pages stage. just turn the page 90° and you get it, just look for the top border and ignore the bottom (the page size doesn't change throughout a book in 99.99 % of all cases...)

Is there a way to skip one of the stages? Then I could at least do the rest by hand...

But most important, I want to thank you a lot for your great work. I scanned successfully several not so problematic books and also some magazines. I am pleased that I could install it easily on my Mac. I guess though, I have to learn a bit about git to get the most recent version compiled here...

spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: Page Layout problems...

Post by spamsickle » 25 May 2010, 11:55

Tulon, I know you're not entertaining new enhancement requests at this point, and since you are (as far as I can tell) still the sole developer actually writing code for Scan Tailor, I completely understand this decision. Is there a list somewhere which contains the enhancements which are already being planned? I got "Git" last week, and find that browsing the change log is giving me a better understanding of the application and its structure. I still hope to be able to contribute to the development effort. If there is a list of planned enhancements, folks like me might be able to pick off an item here and there and free up some of your time. As an aside, I seem to have run into a problem with the QT definition under Visual Studio 2010 and QT 4.6.2 (CMake is balking at recognizing C:/Qt/4.6.2 as QT_DIR, even though that directory contains projects.pro), so I'll either need to back off to 2009 or dig into CMake/CMakeLists to see what's going on.

I find myself sympathizing with TJ Shredder on the layout question. For instance, with many books, the title page in a new chapter has most of the top half blank. Pages at the end of a chapter often have most of the bottom of the page blank. Scan Tailor can correctly identify the content in each case, but then tries to shoehorn that content into a one-size-fits-all layout. If I specify "top" layout, the title pages lose their distinctive character. "Bottom" layout scrunches the end of the chapter. I realize that these layouts may be less useful in an ebook which is text-searchable than in a hardcopy book which is searched by flipping through it, but the aesthetic decisions made in the original layout still add to the reading experience for some of us old dinosaurs who grew up when hardcopy was the only copy available.

I've taken to preserving the layout by adjusting the content selection to have a lot of blank space at the top or the bottom and choosing "center" layout, so that title pages and end-of-chapter pages still stand out when you're scrolling through the text, but that's an additional manual step that wouldn't be necessary if ST just kept the original page spacing. Keeping the original page layout might also eliminate many of those "page number cutoff" and similar content-detection problems. For text-only books, the processing would probably be faster as well, as the many filters and transformations which are required by the content selection step could be eliminated. Obviously, pages which combine text and images will probably still require the full content selection suite, but I get the feeling that many people here are scanning mostly novels or other pure-text books, in which a quick "4 page edges" content selection would suffice.

Tulon
Posts: 687
Joined: 03 Oct 2009, 06:13
Number of books owned: 0
Location: London, UK
Contact:

Re: Page Layout problems...

Post by Tulon » 25 May 2010, 14:40

spamsickle wrote:CMake is balking at recognizing C:/Qt/4.6.2 as QT_DIR, even though that directory contains projects.pro
Have you tried to put that path in manually? If so, I have absolutely no idea what's going on.
spamsickle wrote:Is there a list somewhere which contains the enhancements which are already being planned?
There is an outdated one on the website, but that's it. Two areas I plan working on are dewarping and DjVu generation app. Together, they can easily take a year of my time. I don't keep a track of features requests, in fact I trained myself to quickly forget them. Otherwise the backlog is acting on my nerves.

I also hate to spend my time explaining why this or that feature is either impossible or a bad idea. This one is no exception.
* There are plenty of cases where it can't possibly work (there is no guarantee any edges will be inside the scan/shot).
* Introducing it will create another weak point in Scan Tailor. Right now I receive complains about page numbers being cut off fairly regularly. Implementing this feature would make me also receive complains about edges being detected incorrectly.
* Most of the time you don't want margins like in the original anyway. Books have large margins to put your fingers on. Having such margins in ebooks would be a horrible waste of space.

Now, if we limit ourselves to trying to guess the vertical alignment for each page, that would not be such a bad idea, but sorry, probably not in my lifetime.
Scan Tailor experimental doesn't output 96 DPI images. It's just what your software shows when DPI information is missing. Usually what you get is input DPI times the resolution enhancement factor.

StevePoling
Posts: 290
Joined: 20 Jun 2009, 12:19
E-book readers owned: SONY PRS-505, Kindle DX
Number of books owned: 9999
Location: Grand Rapids, MI
Contact:

Re: Page Layout problems...

Post by StevePoling » 26 May 2010, 02:12

Tulon wrote:At this point new feature requests are not being considered - I already got enough for years to come.
Tulon, would you agree to consider ideas submitted on $100 bills?

A radio show, Car Talk, has a running gag where the announcer encourages listeners to write letters attached to luxury items, then mail them in. You could adapt that idea for Scan Tailor feature requests.

Tulon
Posts: 687
Joined: 03 Oct 2009, 06:13
Number of books owned: 0
Location: London, UK
Contact:

Re: Page Layout problems...

Post by Tulon » 26 May 2010, 02:44

StevePoling wrote:Tulon, would you agree to consider ideas submitted on $100 bills?.
Surprisingly, no. An extra 100$ will change absolutely nothing in my life.
Scan Tailor experimental doesn't output 96 DPI images. It's just what your software shows when DPI information is missing. Usually what you get is input DPI times the resolution enhancement factor.

spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: Page Layout problems...

Post by spamsickle » 26 May 2010, 09:41

Yes, I'm adding the QT_DIR manually. I can't get it to stop worrying that "QT_DIR is [...] Make sure that's correct", but the "Generate" button still becomes clickable anyway, so I guess it isn't a showstopper.

You should never feel obliged to do something you hate on my account. You're the guy who wrote Scan Tailor, so if it's doing what you want it to do, great. Your main design objective was to make books that can be easily downloaded, and that isn't really a concern for me. I get to use the application for free, so if it does anything that I want it to do, I'm ahead of the game. If it doesn't do everything I want it to do, I can either change it myself or find something else that does.

Because of the huge backlog of books I have, I think I'm going to take a break on post-processing anyway. I can rotate and collate my raw JPEGs in no time at all, and read them just fine in any slide show program (with the original formatting!). I wasn't doing OCR before, and my main objective has always been to cut tons of paper down to pounds of optical storage. It's been fun playing with Scan Tailor, and I think it's a great application, but I have to admit it's kind of a distraction at this point. I can always come back to it later, after I've cleared the shelves.

Tulon
Posts: 687
Joined: 03 Oct 2009, 06:13
Number of books owned: 0
Location: London, UK
Contact:

Re: Page Layout problems...

Post by Tulon » 26 May 2010, 15:44

spamsickle wrote:Yes, I'm adding the QT_DIR manually. I can't get it to stop worrying that "QT_DIR is [...] Make sure that's correct", but the "Generate" button still becomes clickable anyway, so I guess it isn't a showstopper.
That's because it's not an error - just an attention grabber. I put it there because I have multiple Qt versions installed and the one that gets chosen automatically is usually not the one I want.
spamsickle wrote:You should never feel obliged to do something you hate on my account. You're the guy who wrote Scan Tailor, so if it's doing what you want it to do, great. Your main design objective was to make books that can be easily downloaded, and that isn't really a concern for me. I get to use the application for free, so if it does anything that I want it to do, I'm ahead of the game. If it doesn't do everything I want it to do, I can either change it myself or find something else that does.
I know. Telling people I don't consider feature requests is one way to avoid having to explain why I think it's a bad idea or a non-important issue.
Scan Tailor experimental doesn't output 96 DPI images. It's just what your software shows when DPI information is missing. Usually what you get is input DPI times the resolution enhancement factor.

Post Reply