Software Requests (Besides Dewarping)?

Anonymous1 · Post by **Anonymous1** » 12 Nov 2010, 01:55

I'm just wondering if anybody has anything they would like to have coded to aid in book scanning. I'm developing a toolkit for scanning books, and I would like to see what people would actually need (I scanned with a scanner, so I cheated

) to make the process as pain-free as possible.

Thanks!

Gerard · Post by **Gerard** » 12 Nov 2010, 04:13

Vectorization
http://www.diybookscanner.org/forum/vie ... ?f=3&t=645
potrace works but does not cover some text features
every character is saved as vector image, it would be a improve in file size and letter quality if the letters would be analysed

so the shape of an "a" is saved only once in the file, i think the step to ocr would be small
but ocr is a lot of work, for the beginning it would be enough that the software would compare shapes, more that one representation of an "a" as vector image could be saved and the software does not need to know if a shape is an letter

on other topic on vectorization, letters most of the time are build of straight or curved lines, this could improve the letter shape if the algorithm would take this knowledge more into account

Anonymous1 · Post by **Anonymous1** » 12 Nov 2010, 09:56

You could just have the user select samples of each letter, right? Or a partially automated thing where the software asks the user if the letter that is detected is what is actually on the page? This seems like a great idea!

Post by **daniel_reetz** » 12 Nov 2010, 12:04

As you might have noticed, some people have problems getting Scan Tailor to work with their scanners. The main issue seems to be that their scans are different than what Scan Tailor is designed to expect. Univurshul has gone through and solved this problem with Adobe Lightroom, but what we could use generally is a "pre-processor" for Scan Tailor that would make the photo-scans closer to Scan Tailor's expectations.

This would be a serious problem-solver for the whole community, and it might be as simple as a program that does some cropping and format conversion/page renaming.

A more in-depth discussion of the problem is here, including sample images:

http://www.diybookscanner.org/forum/vie ... 3&start=10

It can be purely manual.

I will be happy to help test things, promote your software on the blog and recommend it around here, as I do with all the awesome homebrew work we have going on here.

I think vectorization is a good suggestion, but it sounds like a career's worth of work and a new file format, so maybe you should focus your efforts on easier stuff first... also clearscan, djvu, and potrace all do vectorization...

univurshul · Post by **univurshul** » 12 Nov 2010, 15:23

Anonymous,

As you process more and more books in Scan Tailor, you'll routinely notice that ST's auto-content selection box crops away the page number quite frequently (for various reasons). You of all people may have an interest in creating a fixed content selection box within the ST environment to circumvent this issue; it may help your script and provide a big feature many are looking for.

I'll look for this thread later today and link to it.

Gerard · Post by **Gerard** » 12 Nov 2010, 16:41

it would be nice if the working chain (croping, filtering and so on) could use plugins, a plugin could just be an command line with command parameter substitute
this would allow to use other command line programs in you software

just give the user an text area and some variables
e.g.

"imageapp -crop $current_image -size 21230 $outputimage"

Post by **daniel_reetz** » 12 Nov 2010, 17:16

univurshul wrote:Anonymous,

As you process more and more books in Scan Tailor, you'll routinely notice that ST's auto-content selection box crops away the page number quite frequently (for various reasons). You of all people may have an interest in creating a fixed content selection box within the ST environment to circumvent this issue; it may help your script and provide a big feature many are looking for.

I'll look for this thread later today and link to it.

a VERY astute answer; this is also the most requested feature WRT Scan Tailor.

lexicographer · Post by **lexicographer** » 14 Nov 2010, 13:36

If I understand the fixed content selection box correctly, it works on the assumption that all scanned pages have the content on the same spot. This would not work in my case, since my books get moved around when turning the pages (sheer lack of attention; when I photograph a couple of hundred of pages, my mind usually wanders - and ST corrects all that). A fixed content selection box would in my case probably cut off some content on many pages. What I would like instead, is the possibility to set a margin, so that the content selection would be bigger than the content detected by ST. This would e.g. ensure that line numbers which are normally a set distance from the main text (which sometimes get cut off by ST when they are printed faintly) would be selected if they fall within the margin (I have a lot of texts with line numbers). As a consequence the output pages would be wider, but that could easily be corrected in postproduction, since the content would be centered.

spamsickle · Post by **spamsickle** » 14 Nov 2010, 14:21

I don't think "more margin" is going to be the solution in that case either, unless ST always cuts off page numbers. If it doesn't, adding margin to a page on which ST included the page number will probably get you bits of the opposing page, or slivers of the cradle.

I suppose it might still be useful if ST usually cut off page numbers -- then you could globally apply the "more margin" setting, and have fewer images which required manual correction.

My own inclination is to implement something like JPEGcrops uses -- a selection box which could be applied to a range (& optionally, every other element of a range), with a fixed size and position. Then, in addition, offer the ability to drag a selection box as a unit, rather than just adjusting edges or corners. Even when your book moves around as you scan, the size of the pages in the image will stay the same. If a box was already the right size for your content selection, nudging it into place could be faster than dragging a couple of edges to include content that had been lopped off. Even if it wasn't faster (i.e., you could simply drag a corner to pick up the page number), it should still provide better consistency from one page to the next than a set of individual manual adjustments. And, if you only moved the book every once in a while as you scanned, you could benefit from the ability to apply a manual adjustment to all the images shot between moves.

I actually see the manual setting as a way to save overall processing time. Rather than requiring ST to do all the number crunching it does to identify content, just apply a couple of manual settings to all the pages and run with that. "More margin" is going to require ST to do all that number crunching anyway, which is typically the second most expensive step ("output" being the most expensive). My hope is that, in most cases, a manual content selection option will allow me to bypass automatic content selection altogether.

A secondary goal, for me, is to preserve the formatting of the original book. I expect my own "content" selections will be more or less the entire page. That way, "dedications" will fall naturally in the upper third of the page, title pages will be formatted as they were in the original, chapter beginnings will keep white space at the top of the page, while chapter endings will keep white space at the bottom. Others could still, of course, select content as they preferred.

lexicographer · Post by **lexicographer** » 14 Nov 2010, 14:53

I see your point about getting too much if the content-margin is set to big. A fixed size content box which could be moved around with the cursor would certainly be convenient in some cases. However, what attracted me to ST in the first place, was precisely the automatic selection (which - nonwithstanding my complaints - works very well in 99% of the pages) and in my workflow saves a lot of time. I worked with manual splits etc before I found ST (Irfan view), which basically meant I had to check every page anyway, and deskewing was only done by FineReader at the OCR stage (and not very efficiently compared to ST). A post by tulon some time ago made me aware of the possibility to order pages by height and width (I just had never looked at the lower right corner), and that has helped a lot with discovering the pages where content selection is to big (usually quite spectacularly and easy to see on the thumbnails), and the same helps with cut off headers (page numbers) and marginal linenumbers (but not as efficiently since they can be difficult to see on the thumbnails, especially if they are so faint as to be missed by ST). You are of course right that selection is an expensive step, but it runs in the background quite peacefully. I am ashamed to say that it was only yesterday that I discovered that one can click on menu number 4 (content selection) directly after loading the project, and ST will go happily through all the intermediate stages unattended (any mistakes in the splitting can be caught by ordering after width, and so far I have never had a mistake in deskewing).
Anyway, this has become quite a long post, so thanks for reading it (if you get that far).

DIY Book Scanner

Software Requests (Besides Dewarping)?

Software Requests (Besides Dewarping)?

Re: Software Requests (Besides Dewarping)?

Re: Software Requests (Besides Dewarping)?

Re: Software Requests (Besides Dewarping)?

Re: Software Requests (Besides Dewarping)?

Re: Software Requests (Besides Dewarping)?

Re: Software Requests (Besides Dewarping)?

Re: Software Requests (Besides Dewarping)?

Re: Software Requests (Besides Dewarping)?

Re: Software Requests (Besides Dewarping)?