Post by BannanoPeel » 08 Jul 2010, 17:27

First off I would like to say that I love Scan Tailor so far. Previously BookDrive Editor 15 day trial version, which is not nearly as accurate (not to mention they want $1200 for the full version).

There is just one thing that is really needed is a fixed content select box. I'm planning on scanning a ton of books but having to adjust the content box for all the pages will be far to time consuming. Something where you can make a box for the first page and the program automatically crops all the other pages the exact same.

I know the developer of Scan Tailor is not currently taking suggestions. So I'm trying to figure a way to alter my scans in a way so that Scan Tailor things that content is on the outside around the book when it really isn't, then I'll just batch crop all the files when I'm done. Right now I'm trying to put a 2'' border of paper with text around it. I was hoping that this would cause the content box to automatically make a box around it, so I could crop the 2'' out later, but I'm not having much luck. Does anyone have any ideas? It would be really nice if I could just disable select content and margins and then use acrobat to crop all the images.

Post by Tulon » 08 Jul 2010, 18:17

Why do you think you would have to adjust the content box every time? Is auto-detection failing for you frequently? What kind of material do you have?

If I ever implement this feature, I would need very serious reasons for doing so. Generally, I don't like this kind of features, because they are worse than useless for people without special hardware. Activating such a feature will never help them, but will usually hurt, especially considering the lack of undo feature in Scan Tailor.

In addition, I am now working on dewarping, and expect that to take several months. I am not going to distract myself by working on even the features I like.

If you really need this functionality, you would have to do it yourself. Maybe patching the project file would work for you.
Scan Tailor experimental doesn't output 96 DPI images. It's just what your software shows when DPI information is missing. Usually what you get is input DPI times the resolution enhancement factor.

Post by eL_PuSHeR » 09 Jul 2010, 02:46

I think Fixed Box Content should have some usefulness when processing scanned magazines where text is often displaced by photos or have a lot of colored borders. Languages student books are also commonly affected. But in case this feature would be implemented some day it *should* really be a switch, that could be disabled (and should be disabled by default). I am with Tulon on this: This feature doesn't seem very high priority to me.

Post by spamsickle » 09 Jul 2010, 05:54

Well, yeah, automatic content selection fails frequently if you have page numbers, chapter headings, and sometimes footnotes which are too far from the main content block. This is a known problem.

Since my priority at the moment is to get a large number of books off my shelves and onto digital media, I'm currently just taking things as far as "cleaned-up JPGs" rather than investing the hour or two (and sometimes more) of interactive refinement per book which Scan Tailor requires. I am impressed with the quality and consistency of output from Scan Tailor, and expect at some point most of my books will be processed by it or a program like it, but right now I'm spending more time scanning and less time post-processing.

If you want to implement the fixed content box in the manner you've described, I was successful in doing it by saving the project and modifying the project file with a text editor to enforce a constant content box on every page. I used UltraEdit, which allows you to edit using regular expressions. You could probably do this with a script in Perl or some other language as well.

When you open the project again and read in the project file, Scan Tailor honors the manual content box you've set behind its back. You'll probably want to do the Right and Left pages in separate batches unless you've (magically? pre-processing with Photoshop or Jpegcrops?) managed to get more consistency between L and R views than most of us who use these DIY scanner designs. I think if you're careful not to move the book during scanning, you can get L and R images which are consistent enough to make a fixed content selection box workable. Editing the project file can be a quick and dirty way to see how valuable a fixed content selection box would be for your scans. You may just discover that it isn't the solution you hoped it would be, because your scans jitter or wander too much between the first page and the last.


Post by lexicographer » 09 Jul 2010, 06:19

Great tip, spamsickle. I have had only failure to recognize the correct height of the content (i.e. the lower margin), left, right, and top borders are always identified perfectly by ST. So I dont even need to do separate left and right pages, since the height is the same.

Post by spamsickle » 09 Jul 2010, 09:53

If you're keying off the top which Scan Tailor found (adding a "height"), you'll be in scripting rather than editing territory, and it would still fail for title pages and things like the first page in a chapter (where the top is mostly white space). If you're just specifying a bottom value, I think you'd still have to worry about differences in your camera position between left and right -- although those folks (maybe you?) who have monitor displays with L and R images side by side could probably get those aligned before starting to shoot.

Anyway, good luck, and I hope you can make it work for you. Let us know if it doesn't, and what problems remain, and maybe someone will suggest a better solution.

Post by StevePoling » 09 Jul 2010, 13:09

It just occurred to me that maybe ScanTailor could try a different strategy for content selection. Instead of recognizing the page contents, perhaps it could recognize the off-page parts of the image. It's been many years since I did statistical pattern recognition, so I'll apologize up front if what follows is decades out of date.

Consider the structure of an image coming out of a DIY bookscanner. Can one statistically classify each of these page features: cradle, cover edge, paper edge, page margins (gutter, top, bottom & outer edge), header, body, footer?

Most pages of a book will be boringly similar. These "normal" pages' features will be recognized with high confidence. Odd pages, (such as those with illustrations, chapter starts or ends) will be recognized with lower confidence. But since most books are typeset with consistent margins, the margins from normal pages can be suggested for odd pages.

Post by mellow-yellow » 12 Aug 2010, 02:23

Thanks Tulon and Spamsickle for your work! Building on Spamsickle's post above (http://diybookscanner.org/forum/viewtop ... =466#p4244), here are step-by-step instructions to effectively "override" the Select Content GUI option, thus manually setting each page's content area identically:

Overriding "Select Content":
  • 1. Open ST, set everything you need (probably apply to All Pages in most cases) through to #3.
    2. Click #4 Select Content, then Auto, then the Right ("Go" / "Process") arrow.
    3. Again in #4, manually select the contents of an appropriate page (one that represents others content areas)
    4. Click File, Save, close the project
    5. load file in UltraEdit or Notepad++ and press CTRL+R (in UltraEdit, then click Regular Expressions: UltraEdit) or CTRL+H (in Notepad++, then click Regular Expression)
    6. Find: Ultraedit: <content-rect width="*" x="*" y="*" height="*"/>
    Notepad++: <content-rect width="[0-9].+
    7. Replace: <content-rect width="2266.992912969609" x="114.8497733994654" y="559.9999999999999" height="2800.20634481498"/> --- Obviously, use your numbers here! Hint: search for e.g. IMG_2172.JPG, then find it's id="333" and add 2 to it, then do a search for "335" and you'll find this Replace line listed in the select-content section.
    8. Save the UltraEdit / Notepad++ file
    9. Open ST, click File, Open, select your project
After practicing, all 9 steps only takes 2 minutes (max) for L and R combined.

If you need to select multiple lines for some reason, you can try something like this:
  • 1. Find: Ultraedit: [<]param*^n*^n*^n*^n*^n --- this selects a number of lines between the <param> tags
    2. Replace: (with one you like from a fully formed line, such as shown below)
Below is a sample Proj.scantailor file, indicating the start of the Select Content part of the file:

Code: Select all

      <page id="4">
        <params mode="manual">
          <content-rect width="2266.992912969609" x="114.8497733994654" y="559.9999999999999" height="2800.20634481498"/>
          <content-size-mm width="143.9540499735702" height="177.8131028957513"/>
              <point x="165.7588007253573" y="0"/>
              <point x="2411.827783763773" y="93.15644600765081"/>
              <point x="2246.068983038416" y="4089.720472766398"/>
              <point x="0" y="3996.564026758747"/>
              <point x="165.7588007253573" y="0"/>
      <page id="7">
        <params mode="manual">
          <content-rect width="2237.273311897106" x="10.7266881028936" y="401.1832797427654" height="2831.494105037512"/>
          <content-size-mm width="142.0668553054662" height="179.799875669882"/>
              <point x="0" y="0"/>
              <point x="2248" y="0"/>
              <point x="2248" y="4000"/>
              <point x="0" y="4000"/>
              <point x="0" y="0"/>
Post by n9yty » 12 Aug 2010, 10:24

Just a note for the Mac users who may be interested in trying this . . . Amongst probably many options is the freeware "TextWrangler" editor, the small sibling to the mighty BBEdit. It supports regex in search/replace via the "grep" checkbox in the search dialog.

Post by JonEP » 25 Aug 2010, 17:16

I too wish it were possible to "set fixed content area" on the "select content area" dialogue. If one could set a fixed content area, and then go back and manually adjust those pages that need fixing, it would be so much easier for me to use Scan Tailor.

Tulon, I know you have your own ideas about this, and your own life, I'm sure. I wish I were able to code--I'd contribute to the effort.

In any case, please count me in as someone who would appreciate a GUI addition to allow this feature.

NB. The main culprit that seems to require endless adjustment: chapters that start 1/3d of the way down the page, or solitary section titles in the middle of a blank page... Очень расстраивает! (Но спасибо вам за вашу работу)

