Training, by W. G. George, 1902

A place to tell us about your work and projects. Self-links encouraged!

Moderator: peterZ

User avatar
rob
Posts: 773
Joined: 03 Jun 2009, 13:50
E-book readers owned: iRex iLiad, Kindle 2
Number of books owned: 4000
Country: United States
Location: Maryland, United States
Contact:

Re: Training, by W. G. George, 1902

Post by rob »

OK, so using the text block measurements, I get page 20 = 462 dpi, page 120 = 486 dpi, page 121 = 450 dpi. That's reasonably close to what I measured with the page sizes, but these numbers should be more accurate.

I can't find page 21 -- it should be between right/DSC00365 (page 19) and right/DSC00366 (page 23), shouldn't it? Did you miss a page?

1.5 pages is the page split -- there's 1 page and 2 pages, and in between is what I call 1.5 pages :)

Anyway, except for the page split, which is easy enough to set manually for all pages, typically I let ST run in automatic mode just for one stage, and then I scroll around the thumbnails to correct any obvious errors. You can just let it go automatic for all stages (except output), and then correct the errors, but then you have to know which stage a particular problem came from. This is why I like to run just one stage at a time: deskew, correct errors, content, correct errors, margins, correct errors, output.

As for what ST does with the dpi information... it sort of matters. In pretty much every phase, the only thing you really care about is that the pages are physically sized the same, but since images are pixels, you have to provide the translation between pixels and inches; hence, dpi. You could, in theory, provide ST any dpi as long as the pages come out to the same physical measurement. But since you have to measure each page against something, it's just as easy to measure the correct size and work out the dpi from there. Another method is to assume a particular line spacing measurement, and use that.

In the output phase, though, you can choose to output in 600 dpi, which means that if you assumed a fake dpi of 10, your images will be way too large, and if you assumed a fake dpi of 2000, your images will be way too small. Which is why you want to use the real dpi.
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.
User avatar
rob
Posts: 773
Joined: 03 Jun 2009, 13:50
E-book readers owned: iRex iLiad, Kindle 2
Number of books owned: 4000
Country: United States
Location: Maryland, United States
Contact:

Re: Training, by W. G. George, 1902

Post by rob »

Also, this is relevant to this book.
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.
User avatar
rob
Posts: 773
Joined: 03 Jun 2009, 13:50
E-book readers owned: iRex iLiad, Kindle 2
Number of books owned: 4000
Country: United States
Location: Maryland, United States
Contact:

Re: Training, by W. G. George, 1902

Post by rob »

I took a closer look at the page sizes on the left side, measuring the text width. Here's what I got:
graph.png
graph.png (20.19 KiB) Viewed 11843 times
So clearly something's going on. I expect some variation, but not that much, unless your scanner isn't fixed platen. If you have a fixed distance between the camera and the platen, then the page should be the same distance from the camera every time. If, however, your platen isn't fixed, then the distance between the page and the camera will change -- the page will appear slightly larger when it gets closer -- and it's possible that's what we're seeing here.

The other alternative is that the pages really aren't uniform.

In any case, there's not a whole lot you can do to get the images aligned perfectly. Approximate is the best you can do.
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.
vitorio
Posts: 138
Joined: 30 Oct 2010, 23:56
Number of books owned: 0
Location: Austin, Texas, USA
Contact:

Re: Training, by W. G. George, 1902

Post by vitorio »

rob wrote:I can't find page 21 -- it should be between right/DSC00365 (page 19) and right/DSC00366 (page 23), shouldn't it? Did you miss a page?
Hah! It appears so! I'll retake that page tonight.
rob wrote:So clearly something's going on. I expect some variation, but not that much, unless your scanner isn't fixed platen. If you have a fixed distance between the camera and the platen, then the page should be the same distance from the camera every time. If, however, your platen isn't fixed, then the distance between the page and the camera will change -- the page will appear slightly larger when it gets closer -- and it's possible that's what we're seeing here.

The other alternative is that the pages really aren't uniform.

In any case, there's not a whole lot you can do to get the images aligned perfectly. Approximate is the best you can do.
Wow, that's a fascinating graph.

So, the distance to the page would be changing slightly, as I turn the page and it removes or adds thickness, so eventually it will be different by the entire width of the book, right? But maybe that shouldn't be a thousand pixels of resolution difference.

Also, the book is slightly off, internally. Some of the pages appear to have been printed in different faces, or scaled from different sizes, I assume to fit within the 13.9cm by 7.7cm internal width.

Does this mean the only way to do it completely accurately is to physically measure each page?

Perhaps there should be a machine-readable ruler on the platen…
User avatar
rob
Posts: 773
Joined: 03 Jun 2009, 13:50
E-book readers owned: iRex iLiad, Kindle 2
Number of books owned: 4000
Country: United States
Location: Maryland, United States
Contact:

Re: Training, by W. G. George, 1902

Post by rob »

vitorio wrote:So, the distance to the page would be changing slightly, as I turn the page and it removes or adds thickness, so eventually it will be different by the entire width of the book, right? But maybe that shouldn't be a thousand pixels of resolution difference.
Well, maybe not a thousand pixels, but certainly tens of pixels. Note that the graph includes a jump corresponding to the jump in resolution I noticed earlier.
vitorio wrote:Does this mean the only way to do it completely accurately is to physically measure each page?
Well, that depends. If your goal is to produce a file that is better than the book, then yes, you have to physically measure the text block on each page to ensure that the text block is the same size on every page. This assumes that the book has nonuniform text block sizes.

If your goal is to produce a file that reproduces the book, then you need a fixed platen scanner and a measurement of one page on the left and one on the right. If you have a moving platen scanner, then you'll have to measure it every ten pages or so and adjust the dpi for that group of pages to get something closely resembling the actual resolution.

I'm still working on your images, and I think I'll have something close to acceptable.
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.
User avatar
rob
Posts: 773
Joined: 03 Jun 2009, 13:50
E-book readers owned: iRex iLiad, Kindle 2
Number of books owned: 4000
Country: United States
Location: Maryland, United States
Contact:

Re: Training, by W. G. George, 1902

Post by rob »

Here's what I got: https://spideroak.com/browse/share/Romeo_and_Sushi/Book

In the margins stage, I set the margins to 0.1 mm on each side, applied to all pages, and then went to the thumbnails and ordered by height. Went to the last one, which is the tallest, and found that this was an unusual size, so I checked off same size page for that one.

Then I manually adjusted the left/right margins on all the pages, setting the page to align left or right. I probably didn't have to do that, but I wanted the result looking nice.

Output in color, and there's the result.
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.
vitorio
Posts: 138
Joined: 30 Oct 2010, 23:56
Number of books owned: 0
Location: Austin, Texas, USA
Contact:

Re: Training, by W. G. George, 1902

Post by vitorio »

Thanks for walking me through your process, it's been super informative.

Assembling it color really makes clear the occasional poorly focused page, which I didn't notice as much when going through Scan Tailor zoomed out. Even with Scan Tailor's magic, though, you can still see the problems with the scanner itself around the edges, and I think I'll have to go with black-and-white text output for this book.

Looking forward to spending some more time with the originals.

I guess once I'm done, I'll run it through OCR and then something like gutcheck. Are there commercial OCR correction services? Seems like Distributed Proofreaders has a big backlog and only does it to submit to Project Gutenberg.
User avatar
rob
Posts: 773
Joined: 03 Jun 2009, 13:50
E-book readers owned: iRex iLiad, Kindle 2
Number of books owned: 4000
Country: United States
Location: Maryland, United States
Contact:

Re: Training, by W. G. George, 1902

Post by rob »

I think I could use this in some kind of video tutorial for Scan Tailor. There are a lot of interesting issues with this book that I think people can learn from. Can you get me that missing page?

OK, so for black and white output (actually mixed, because I think you want the pictures not to be binarized), it's not so important to fiddle with the margins. I mean, it is, but not in the way that you have to if you want a faithful color copy. It's a lot easier, though, since you don't have to deal with the edges of pages or the ugly spine showing up in your images.

Start by setting the margin for every page to some uniform size, say 13mm all around (about half an inch), with alignment to the top middle. The reason for that is you want every page's text block to align at the top, and be centered on the page horizontally.

Then you need to fix up the fat pages. Remember that Scan Tailor will find the largest content block, add the margins, and use that as the uniform page size. If there's a single page that has a large content block, all your pages will be at least that large (plus the margins). So finding those pages and reducing the margins for those pages is important if you want a uniform page size. The alternative is to find the fat pages and just remove the check from "make same page size", but then you'll end up with the occasional page that's larger than the rest of the pages. If that's ok, do that.

Using the same control at the bottom of the thumbnail list in the margin phase, order by height, find the tallest page, make the margins smaller for that page (or check off "same page size"), and keep doing that until you start hitting your "normal" pages. Do the same with width.

Finally, order naturally, and scroll through to find things like chapter starts, charts, and so on, which shouldn't be aligned to the top, and fix the alignment. Find the blank pages and set them as center aligned.

And you're done! Any remaining jiggling in the page numbers would be due to either genuine artifacts in the book, or that non-fixed platen thing I mentioned earlier which results in nonuniform resolution.

Finally, package the resulting tifs in whatever form you like. djvubind, pdfbeads, whatever. I personally use Adobe Acrobat with Clearscan turned on, which compresses the file immensely and also makes it look much better. The OCR isn't that bad, either.
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.
User avatar
rob
Posts: 773
Joined: 03 Jun 2009, 13:50
E-book readers owned: iRex iLiad, Kindle 2
Number of books owned: 4000
Country: United States
Location: Maryland, United States
Contact:

Re: Training, by W. G. George, 1902

Post by rob »

vitorio wrote:I guess once I'm done, I'll run it through OCR and then something like gutcheck. Are there commercial OCR correction services? Seems like Distributed Proofreaders has a big backlog and only does it to submit to Project Gutenberg.
Honestly, for a book as small as this, you might just want to do it yourself, considering the investment in time you've already put into it. That being said, I don't know offhand of any OCR correction service. Googling "OCR correction service" throws up a whole bunch of services, but I haven't had any experience with any of them.
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.
vitorio
Posts: 138
Joined: 30 Oct 2010, 23:56
Number of books owned: 0
Location: Austin, Texas, USA
Contact:

Re: Training, by W. G. George, 1902

Post by vitorio »

That last link is a .zip file which includes the missing page 21 (named 366-1 so it should sort between 365 and 366 automatically), plus a few other pages that looked egregiously blurry or out-of-focus, named for the files they should replace.

These photos were taken the same way as the others: 1.9x zoom, centered over the book with the focus targets on either side, set a white balance, take the shots.

Users of these files should crop and set the DPI on these files before replacing the originals with them, as they now constitute the fourth separate scanning session. All the other pages should be figure-out-able, but the one with the table, the top line of the table is 5 and 3/16ths of an inch wide.

Now to work them over myself!
Post Reply