Basic Guide to Workflow ?

Don't know where to start, or stuck on a certain problem? Drop by and tell us about it. Feel like helping others? Start here.

Moderator: peterZ

User avatar
strider1551
Posts: 126
Joined: 01 Mar 2010, 11:39
Number of books owned: 0
Location: Ohio, USA

Re: Basic Guide to Workflow ?

Post by strider1551 »

univurshul wrote:I did wonder about compression now that you mention it: what apps are you using to compress your TIFFs post-Scan Tailor?
Depends on the image, which is partly why I wrote djvubind. The djvulibre tool for black-and-white images is cjb2, which uses pattern matching and can be set to either lossless compression or a lossy method that will allow small changes to create more matches. I use minidjvu, which from what I know does the same thing as cjb2 but uses a shared dictionary of matches for several pages (djvubind is set to 100 pages per dict, I think). An image with more colors get separated, the black-and-white portions being sent to cjb2 before being merged back to the original with csepdjvu. I actually have a lot more to learn about the different compression methods and what might work even better, but there some more pressing areas of development for djvubind before I start fine-tuning things that already work very well.

I work with the djvu format mainly because the opensource toolset for it is quite capable (that or I know more about it and it just seems more capable!). I guess Misty is working on a pdf maker script, and I'm really looking forward to seeing direct comparisons of djvu and pdf files produced from the same set of images.
univurshul wrote:What would you recommend for iPad displays?
I'm not sure what your concern is, but I probably can't answer the question. Things like power consumption and render speeds would depend on the method of compression used in the image (and generally it is a trade off between power, speed, and size). Audio and video codecs are compared all the time for those things, but I haven't looked for a comparison of image compression and I doubt there would be a significant difference. Image quality would only be degraded with lossy compression. So long as it is a lossy compression designed for images of text, there shouldn't be a difference easily noticeable to the human eye (jpg compression, for example, is made for photo image and should never be used on a text image like we work with here).
univurshul wrote:Looking forward to djvubind once ported to the Mac OS.
I recently learned that minidjvu doesn't build on Mac. I made minidjvu a dependency because I've seen it cut file sizes almost in half on files that were already impressively small. I have two major things I still want to get into djvubind: cuneiform ocr engine (essentially done), and multicore improvements. Once those are out of the way, I plan to make minidjvu used only if present and cjb2 otherwise, and then post something here looking for people with macs to tell me if the code even runs, let alone how to package it for easy installation.


...and a little closer to the original topic, I would encourage you, recaptcha, that there is no need to build a book scanner before trying out the the software side firsthand. Take a quick picture of a book held open with your hand or a piece of paper or something; forget about getting the angle right and eliminating page curvature - that's half of what the book scanner is for. Toss the image on your computer, see how Scantailor works (and the things you can't expect it to fix!), try different formats like pdf, djvu, etc.
User avatar
daniel_reetz
Posts: 2812
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: Basic Guide to Workflow ?

Post by daniel_reetz »

[quote=strider1551]
...and a little closer to the original topic, I would encourage you, recaptcha, that there is no need to build a book scanner before trying out the the software side firsthand. Take a quick picture of a book held open with your hand or a piece of paper or something; forget about getting the angle right and eliminating page curvature - that's half of what the book scanner is for. Toss the image on your computer, see how Scantailor works (and the things you can't expect it to fix!), try different formats like pdf, djvu, etc.
[/quote]

Some of the best advice ever given on this forum!
recaptcha
Posts: 64
Joined: 03 Sep 2010, 13:23
Number of books owned: 0
Location: Calgary, Alberta, Canada

Re: Basic Guide to Workflow ?

Post by recaptcha »

univurshul wrote:
recaptcha wrote:...I'm on a Mac with a bootcamp partition, so I guess I could go either way (Windows or Mac). I'm willing to use whatever software makes things faster and easier e.g. Abbyy FineReader, OmniPage, or any of the open source ones mentioned around here. (I read somewhere that you can scan the page image right into AbbyyFineReader, thus eliminating at least one importing step).
When you have a book scanner, and if it's a dual camera rig, I have a Mac instructional that will help you import & prepare your images for Scan Tailor: http://www.diybookscanner.org/forum/vie ... ?f=3&t=527

Scan Tailor is superb for processing camera images. But if you're simply working with a flat bed scanner for now, then dedicated scan software like Abbyy should do the trick.
Thanks. So far it seems like it goes:
1). Take camera images of book pages, dump onto iPhoto (Mac) or My Pictures (windows)
2). Import pictures to Scan Tailor to be cleaned up
3). Import cleaned up images to an OCR program (?)

If I got Abbyy, it cleans up images as well as OCRs, plus converts to PDF. So I wouldn't necessarily need Scan Tailor. Or do you really recommend Scan Tailor over Abbyy for the pre-OCR step?
recaptcha
Posts: 64
Joined: 03 Sep 2010, 13:23
Number of books owned: 0
Location: Calgary, Alberta, Canada

Re: Basic Guide to Workflow ?

Post by recaptcha »

daniel_reetz wrote:
strider1551 wrote: ...and a little closer to the original topic, I would encourage you, recaptcha, that there is no need to build a book scanner before trying out the the software side firsthand. Take a quick picture of a book held open with your hand or a piece of paper or something; forget about getting the angle right and eliminating page curvature - that's half of what the book scanner is for. Toss the image on your computer, see how Scantailor works (and the things you can't expect it to fix!), try different formats like pdf, djvu, etc.
Thanks strider, sounds like a good idea. Does Scan Tailor also do a good job on books with a lot of graphs, charts and photos? Would my camera settings be any different for those type of books?
Tim

Re: Basic Guide to Workflow ?

Post by Tim »

If I got Abbyy, it cleans up images as well as OCRs, plus converts to PDF. So I wouldn't necessarily need Scan Tailor. Or do you really recommend Scan Tailor over Abbyy for the pre-OCR step?
Depends on if ABBY or Omnipage are working well enough for your needs. If it is, just use that. What I find is that OmniPage that I use most doesn't crop pages well like scantailor does and there are a significant number of errors from artifacts near the borders that Omnipage tries to recognize as text. Since my goal is perfect edited text, I end up wasting a lot of time cleaning up things like that, so I like the cleaned up images from Scantailor. Scantailor also removes the background color better than the commercialOCR packages do, and cleans up the text enough to make the OCR process a little smoother. I would say, definitely just try the OCR package first if that is your main goal. If it's good enough and fast enough, that may be all you need. But then try Scan Tailor and see how that compares.
Does Scan Tailor also do a good job on books with a lot of graphs, charts and photos?
It does a remarkably good job with them. In the mixed mode in the Output stage it can treat the graphs, charts, and images as color images, and the rest as black and white text (I think. It could be greyscale, I didn't test it.)
Would my camera settings be any different for those type of books?
Maybe you would want to be even more careful about your white balance setttings so your colors come out right, but other than that no changes. If you use incandescent lights, the tungsten setting on your camera does a pretty good job overall.
umpausewhat
Posts: 22
Joined: 04 Mar 2014, 00:55

Re: Basic Guide to Workflow ?

Post by umpausewhat »

Hi all,

This is my first post and I'm sure I haven't seen all the relevant discussions yet, but I wanted to chime in to describe the setup I've got, illicit any feedback, and maybe save someone some time working out the problems I've been encountering with my limited tool set. Workflow wise, I want to be able to scan and process a book with about an hour's worth of attention--I'm in the humanities, so a lot of my books are in the 300 page range and are primarily simple text--not much math, which seems to create issues for some of you. I'm also not a computer expert--don't know anything about coding and don't have any working experience with lenox. I thought I'd describe my setup in part so others at a similar level of ignorance might see how this can work with relatively common programs, namely, scan tailor + adobe acrobat standard.

I've only scanned a few books so far and spent a lot of time trying to work out the kinks, but at this point I think something like the following gives a good balance of speed and quality (I'll mention some of the problems I encountered and the fix that seemed to work):

1) Capture pictures (I've got a bkrpr setup with two canon 590 IS cameras, which I purchased because of recommendations on this site)
2) Download pics into dedicated folders for even and odd pages (because of what happens later in the process, it's important to take pictures of blank pages so that the even and odd pages can later be quickly collated and kept in the proper page order)
3) Open Scan Tailor projects for the odd and the even pages; I've been experimenting with different dpi inputs and outputs--600 is ok sometimes, but I've also gotten some bad looking text with that setting and upped the dpi to 1200)
4) Run the Scan Tailor--as others have noted, it sometimes leaves out page numbers in the select content procedure, so you have to do a lot of manual adjustments sometimes.
5) Select all the output pages for odd, right click (I've got Windows) and select "combine files in Adobe acrobat"--I have Adobe Standard, which came with the scansnap scanner I picked up a few months ago). Repeat for even page outputs.
6) After saving separate pdf files for the odd and even pages, I run a script I found on this discussion board (http://forums.adobe.com/thread/54831) to collate the two files into one pdf. You basically open the file of odd pages, click "tools," then "collate" and it will prompt you to select the file you want to collate with the current one.
7) Print file to new pdf. I found I have to do this because of how small the Scan Tailor output image is; when I run the Adobe OCR, it doesn't seem to like such small text, even if it's relatively clear (San Tailor output pages can be just a few square inches).
8) After saving new pdf, run the ocr--I use the Clearscan setting. This makes the file smaller and much easier to read, even though it makes the original image inaccessible (you can't re-ocr the doc after running Clearscan). I think Clearscan basically creates custom fonts with vector algorithms, but I actually have little idea what that means. If you want to keep the original images, save the clearscan ocr doc as a separate file.

This procedure won't result in files as small as those I've seen discussed here (djvu), but it employs familiar applications (well, adobe is familiar anyway and scan tailor is user friendly enough for someone like me to figure it out) and gives decent results. Sorry if this all seems basic and has been discussed elsewhere. During the last few weeks of getting this up and running, I was always looking for some one-stop post that would tell me why my results were bad and what I could do with the tools I had to fix things. I hope this might save someone else working with similar tools a few hours of troubleshooting. But if anyone of you see huge gaffs here, I welcome any responses, even if it's to tell me I should have read all the other discussions first.

Thanks for existing--I've been checking in here for a while now trying to figure out how to get my own setup going and your discussions are very helpful.
Tulon
Posts: 687
Joined: 03 Oct 2009, 06:13
Number of books owned: 0
Location: London, UK
Contact:

Re: Basic Guide to Workflow ?

Post by Tulon »

umpausewhat wrote:San Tailor output pages can be just a few square inches
This indicates the input DPI was way off. See my signature.
Scan Tailor experimental doesn't output 96 DPI images. It's just what your software shows when DPI information is missing. Usually what you get is input DPI times the resolution enhancement factor.
umpausewhat
Posts: 22
Joined: 04 Mar 2014, 00:55

Re: Basic Guide to Workflow ?

Post by umpausewhat »

Tulon wrote:
umpausewhat wrote:San Tailor output pages can be just a few square inches
This indicates the input DPI was way off. See my signature.
Thanks Tulon. Noob mistake. I didn't really understand DPI--thought it just meant resolution and was experimenting with higher dpis to see if I could get sharper images. Correcting the mistake makes a huge difference.
JJJM
Posts: 26
Joined: 13 May 2010, 01:24

Re: Basic Guide to Workflow ?

Post by JJJM »

umpausewhat wrote:
8) After saving new pdf, run the ocr--I use the Clearscan setting. This makes the file smaller and much easier to read, even though it makes the original image inaccessible (you can't re-ocr the doc after running Clearscan). I think Clearscan basically creates custom fonts with vector algorithms, but I actually have little idea what that means. If you want to keep the original images, save the clearscan ocr doc as a separate file.
I didn't know about this option at Acrobat. I am playing with it and it seems it gives very good results at least for my purpose which is to get paperbooks into my ereader. It's not the most perfect ocr solution but it gives you a enough readable document within a short time, avoiding all postprocessing time consumption when you use Finereader. I think clearscan deserves a look and it susrprises me it is the first time someone talks about it at this board.

Thanks a lot, and I would welcome more comments on Clearscan.
spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: Basic Guide to Workflow ?

Post by spamsickle »

I didn't know about the Clearscan option in Acrobat either, but that's not surprising: the last version of Acrobat I bought was 5.0, and this option was apparently introduced in version 9. Finally, maybe I have a reason to upgrade...

I'm curious -- a couple of people in this thread have mentioned that they think pre-processing with Scan Tailor gives them better OCR results. I haven't tended to do OCR on my scans, so I don't have any results of my own, but I'd think if one was using the Clearscan option, one would want the original images rather than the Scan Tailor binarizations, as I would expect those to produce better vector fonts. Is it possible to pipe the post-Scan Tailor text into a Clearscan Acrobat step, and get the best of both worlds (post Scan Tailor OCR, plus pre Scan Tailor vector fonts)?
Post Reply