Initial File Format

Just what it says.

Moderator: peterZ

Post Reply
pajaro
Posts: 1
Joined: 04 Mar 2014, 00:52

Initial File Format

Post by pajaro »

My questions regards initial file format. I had been scanning either directly to PDF or jpg for a long time, but the final files were huge. By accident I scanned a whole book in BMP format two months ago and, although the images were initially larger than those in jpg, the final PDF was much, much smaller. Since then I only scan to BMP and later convert the files into PDF.

Since the beginning affects the whole postproduction, I would like to know your opinion about the best file format as source for the projects. My questions are as follows:

1) should we scan to jpg, BMP, TIFF, etc. ?

2) Is any of the above formats better for OCR recognition?

3) Which file format creates smaller PDFs?

Your experience is welcome.

Sorry if I've not posted were I should.

Best regards,

pajaro
User avatar
rob
Posts: 773
Joined: 03 Jun 2009, 13:50
E-book readers owned: iRex iLiad, Kindle 2
Number of books owned: 4000
Country: United States
Location: Maryland, United States
Contact:

Re: Initial File Format

Post by rob »

If you're getting your images from a camera, of course you're going to end up with JPEG. From a scanner, grayscale (8bpp) JPEG at extremely high quality is probably OK.

I think for OCR, any format is good, as long as it's high-resolution and high-quality. Most likely the output from Scan Tailor (1bpp, 600 dpi) is suitable.

PDF documents embed images in JPEG or TIFF. So if you had BMPs (which are 1 bit per pixel), it's possible that your PDF creator converted the BMP images to TIFF (which can get really small for 1 bit per pixel).
The Singularity is Near. ~ http://halfbakedmaker.org ~ Follow me as I build the world's first all-mechanical steam-powered computer.
StevePoling
Posts: 290
Joined: 20 Jun 2009, 12:19
E-book readers owned: SONY PRS-505, Kindle DX
Number of books owned: 9999
Location: Grand Rapids, MI
Contact:

Re: Initial File Format

Post by StevePoling »

rob wrote:I think for OCR, any format is good, as long as it's high-resolution and high-quality. Most likely the output from Scan Tailor (1bpp, 600 dpi) is suitable.
I don't know anything. But I'd think that Jpeg compression artifacts could get in the way of OCR. I figure the edges of every letter's glyph is a high-frequency thing that the JPEG's discrete cosine transform can't represent in a finite bandwidth (the halos you see in highly compressed imagery). Thus I'd expect a non-compressed or wavelet transform compressed form to work better. But then I know nothing about how OCR does feature extraction.
fbonomi
Posts: 59
Joined: 04 Mar 2014, 00:52

Re: Initial File Format

Post by fbonomi »

rob wrote:If you're getting your images from a camera, of course you're going to end up with JPEG.
actualy, that is not necessarily true. Our little crappy Canon cameras, thanks to CHDK, are able to save in RAW format, that is a lossless format straight from the sensor, so in many aspects comparable to a BMP file (i.e. no JPEG artifacts)

Here I did some test about this:
http://www.diybookscanner.org/forum/vie ... t=raw#p754

The (somehow surprising) result is that (at least in my setup) saving in RAW does not give any significant advantage.

Maybe OCRs are already optimized to handle (a certain amount) of JPEG artifacts?
Last edited by Anonymous on 14 Oct 2009, 04:48, edited 1 time in total.
fbonomi
Posts: 59
Joined: 04 Mar 2014, 00:52

Re: Initial File Format

Post by fbonomi »

Now that I think about it....

It might seem obvious, but I am pretty sure somebody will have overlooked this, and I didn't find it mentioned anywhere:

Set compression to "Superfine" !!!

This setting tells the camera how much the JPGs should be compressed. You find this setting in the menu that opens with the "FUNC / SET" button, and these are its possible values:
compression.png
compression.png (15.18 KiB) Viewed 7745 times
Without having done any test, I absolutely guess the "Superfine" mode (the one with the little S) gives better results in OCR.
Post Reply