Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

Software Requests (Besides Dewarping)?

Discussions, questions, comments, ideas, and your projects having to do with DIY Book Scanner software. This includes the Stereo Data Maker software for the cameras, post-processing software, utilities, OCR packages, and so on.
StevePoling
Posts: 290
Joined: 20 Jun 2009, 12:19
E-book readers owned: SONY PRS-505, Kindle DX
Number of books owned: 9999
Location: Grand Rapids, MI
Contact:

Re: Software Requests (Besides Dewarping)?

Post by StevePoling » 14 Nov 2010, 18:43

hey, guys, we've talked about it in other threads, but what about page renumbering. You've got two cameras generating randomly named (but sequential) page image files. Generally, we put the images into left and right directories. It is really nice to have them renamed into something sequential that you can then feed to ScanTailor.

This problem has been solved before with a bit of scripting here or there. I don't quite think we have a "canonical" solution, that'll merge two directories' image files into an output directory with everything numbered exactly like you like it.

Sometimes when I'm scanning I find that I goof up and either double-clutch (imaging a pair of pages twice), or skip. After I discover this, I need to insert or delete some pages. And it's annoying to rename files afterwards.

Some tasks are inherently difficult, e.g. OCR, but others are more tractable, but annoyingly tedious. If you're looking for a target of opportunity, I think a definitive "file name renumberator" could be easy enough to pull off, and tricky enough to be interesting. I'm interested enough in doing something like this that I'll help, but i'm not going to bother if nobody else thinks it's worth the bother.

User avatar
daniel_reetz
Posts: 2797
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: Software Requests (Besides Dewarping)?

Post by daniel_reetz » 14 Nov 2010, 18:51

It's not quite what you want, but Matti wrote a renamer for our Instructable using glass:

http://www.instructables.com/id/Bargain ... ox/#step10

There's also Bulk Rename Utility:

http://www.bulkrenameutility.co.uk/

As well as the recently introduced File Wrangler:

http://diybookscanner.org/forum/viewtop ... =674#p6473

And Anonymous's own OCR Page Namer:

http://diybookscanner.org/forum/viewtop ... =674#p6335

I point these out not to discourage anyone, but rather to put in one place the many approaches to renaming that we have so far. A pre-processor for Scan Tailor could easily make this its first function.

univurshul
Posts: 496
Joined: 04 Mar 2014, 00:53

Re: Software Requests (Besides Dewarping)?

Post by univurshul » 14 Nov 2010, 19:27

...which is why I started a post about clocking cameras so that upon import, files are already sorted based on their capture time: http://www.diybookscanner.org/forum/vie ... ?f=3&t=627

Why make your CPU work any extra when sorting can be done in real time?

spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: Software Requests (Besides Dewarping)?

Post by spamsickle » 15 Nov 2010, 08:17

StevePoling wrote: This problem has been solved before with a bit of scripting here or there. I don't quite think we have a "canonical" solution, that'll merge two directories' image files into an output directory with everything numbered exactly like you like it.
I guess that depends on what you mean by "exactly like you like it." For me, if the pages are in the same order they were in in the book, that's exactly like I like it. I've seen some comments -- maybe from you, I don't remember -- which suggest that some people want page "iii" in the book named "iii", and page "43" named "43". I asked in that OCR page naming thread if there is a reader which recognizes "page name" in this manner, and so far nobody's said yes. Unless there is such a reader, I don't see the point for myself of doing anything more than keeping all the pages with information in the proper sequence.
StevePoling wrote: Sometimes when I'm scanning I find that I goof up and either double-clutch (imaging a pair of pages twice), or skip. After I discover this, I need to insert or delete some pages. And it's annoying to rename files afterwards.
Yes, and this is why I think there may be a need for the kind of software you're proposing here. The batch renamers -- including the scripts I use myself -- work fine as long as there is a complete set of R and L pages. And really, with Scan Tailor and a "sequence is sufficient" attitude, pages which are shot twice are not a problem -- you just delete the duplicates from the ST project, and they disappear from the final product.

The real problem comes in when pages are missing. If you're missing a single image, say Right-42, your batch renamer has probably messed up the order for everything that comes after -- now Left-41 is followed by Right-44, which is followed by Left-43, etc. This can happen if one camera fails to fire for some reason, and it can be difficult to correct if you only notice it after a batch rename/merge and you haven't kept separate L and R originals.

If you're missing a pair of pages, because you turned two pages instead of one, you have to insert a pair of pages.

If your batch rename uses names like 0001L 0001R etc. and you've kept separate originals, inserting can be relatively painless -- just add the images to the L or R directory in the proper place (i.e., add 0043M between 0043L and 0044L in the "Left" directory), run your batch renamer making 0043M the new 0044L etc., and re-merge. This can still be a problem if you've already done Scan Tailor processing and saved the project, since Scan Tailor will keep information on each of the images by name, and you've now renamed them behind its back.

In such cases, I think the proper thing to do is to add the missing image to the Scan Tailor project with a new name, without renaming images ST has already processed. Then, do your sequencing rename on the output, maybe even after the TIFF files have been converted to PDFs.

I like software that works all the time. I don't like to have to worry about "exceptions" or discover them after the fact. It's possible, by keeping left and right originals and employing a sensible renaming scheme, to handle deletions and insertions with a batch renamer. The problem is, it requires me to stop and think, when what I'd really like to do is say "Page 47 missing? I've got your page 47 right here..." and have the software take care of all the messy details behind the scene. Those messy details may include a Scan Tailor project file keyed by name to a lot of processing which is already complete.

Hmmm, now that I think about it, "I've got your page 47 right here" does seem to make a case for naming page 47 something like 00047, whether the reader recognizes it or not... I still don't like an OCR renamer, because as I said, I want software that works every time, without requiring me to stop and think about exceptions. I have books in which page 46 is followed by a dozen unnumbered picture pages before page 47. I have art books in which almost none of the pages display a page number. An OCR renamer can keep them in order, but that implies I've already done one rename that PUT them in order.

I'm rambling, thinking in text here. I guess what I'm saying is, I think there is a need for something like you propose, but doing it right -- covering all the bases, so it always works and I never have to think beyond "insert this here" to use it -- may be more trouble than it's worth. And if you can only do it almost right, I'm not sure I wouldn't prefer to stay with a batch renamer (script, in my case) and a system I understand.

Anonymous1

Re: Software Requests (Besides Dewarping)?

Post by Anonymous1 » 15 Nov 2010, 11:54

Yeah, the batch renamers are really useful for complete collections of pages, but they fail when a single page is missing (I actually use that feature to figure out what pages I'm missing; when the page numbers != file numbers, I just work backwards to find out the culprit). I've found Métamorphose (it's completely open-source, written in Python) to have the most complete feature set (the Beta is amazing, and not a single crash yet!): http://file-folder-ren.sourceforge.net/.

As for the Scan Tailor page-splitting issues, may I ask how ST does it? Before I discovered ST, I wrote a script with ImageMagick and Bash which basically graphs the colors of the image, finds the maximum (it's pretty cool how it looks; the text is a jagged line, then there is a smooth break, a pointed curve in the middle, and the jagged lines again). I've never gone around to implement this fully into any program, but I'll post a sample graph. It's pretty cool.

Anonymous1

Re: Software Requests (Besides Dewarping)?

Post by Anonymous1 » 15 Nov 2010, 17:40

Okay, I've actually implemented the graphing into a quick Python script.

Here is the original image (thanks Google):
Image

Here is the intensity graph of that same image (normal):
Image

Here is the intensity graph of the image in bitonal:
Image

You can clearly see the interesting parts of the book from the graphs. I'm seeing if I can use this to detect text, seams, etc.

Anonymous1

Re: Software Requests (Besides Dewarping)?

Post by Anonymous1 » 15 Nov 2010, 17:57

Here's another sample. This one is a bit more clear, and I applied a Gaussian Smoothing function to the data. It is really obvious now.
Image
Image

All that has to be done now is the derivative of such function is taken, and that is analyzed. It is quite fun!

User avatar
daniel_reetz
Posts: 2797
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: Software Requests (Besides Dewarping)?

Post by daniel_reetz » 15 Nov 2010, 18:25

This is a really neat demo... but will it work on pages from our camera based scanners? Can someone share some scans/data with Anonymous?

We really need a DIY Book Scanner dataset of page images for projects like this.

Anonymous1

Re: Software Requests (Besides Dewarping)?

Post by Anonymous1 » 16 Nov 2010, 01:15

I went DIY as I possibly could; I took a picture of a book laying on my floor, cropped it, and centered the seam. The results are almost identical (here's a composite):
Image

I'm just going to see how I can extract that data from the graph mathematically, as I can easily do it visually...

User avatar
Gerard
Posts: 154
Joined: 17 Oct 2010, 07:15
Number of books owned: 0
Location: Berlin (Germany)

Re: Software Requests (Besides Dewarping)?

Post by Gerard » 16 Nov 2010, 05:57

Hi,

could you post the last image without the red graph, i could try also an combination of filters to extract the data

Post Reply