Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

Mac users: jbig2enc is now in Homebrew! (Make better PDFs)

General discussion about software packages and releases, new software you've found, and threads by programmers and script writers.
Post Reply
User avatar
Misty
Posts: 481
Joined: 06 Nov 2009, 12:20
Number of books owned: 0
Location: Frozen Wasteland

Mac users: jbig2enc is now in Homebrew! (Make better PDFs)

Post by Misty » 19 Aug 2011, 10:07

Good news for Mac users interested in making PDFs! jbig2enc just got added to Homebrew, the best way to install open source software in Mac OS X. jbig2enc is an open source tool that allows you to make very small PDFs from pages you've processed in Scan Tailor, a fraction of the size of what you can make with other PDF tools (and about the same size as DjVu). You can use it on its own for text-only books, or in conjunction with PDFBeads for books that use a mixture of text and illustrations.

If you have Homebrew installed, you can install jbig2enc easily just by typing

Code: Select all

brew install jbig2enc
in Terminal. Then you're ready to compress TIFFs from Scan Tailor by typing 'jbig2'.
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.

rxninja

Re: Mac users: jbig2enc is now in Homebrew! (Make better PDF

Post by rxninja » 04 Oct 2011, 10:36

I'm at this point in Terminal:

Code: Select all

Options:
  -b <basename>: output file root name when using symbol coding
  -d --duplicate-line-removal: use TPGD in generic region coder
  -p --pdf: produce PDF ready data
  -s --symbol-mode: use text region, not generic coder
  -t <threshold>: set classification threshold for symbol coder (def: 0.85)
  -T <bw threshold>: set 1 bpp threshold (def: 188)
  -r --refine: use refinement (requires -s: lossless)
  -O <outfile>: dump thresholded image as PNG
  -2: upsample 2x before thresholding
  -4: upsample 4x before thresholding
  -S: remove images from mixed input and save separately
  -j --jpeg-output: write images from mixed input as JPEG
  -v: be verbose
Pretend I've never used Terminal, Homebrew, or jbig2enc before. What do I do here?

User avatar
Misty
Posts: 481
Joined: 06 Nov 2009, 12:20
Number of books owned: 0
Location: Frozen Wasteland

Re: Mac users: jbig2enc is now in Homebrew! (Make better PDF

Post by Misty » 04 Oct 2011, 11:01

If you're at the point of running jbig2enc, you should already have processed TIFF files you want to turn into a PDF. If you haven't processed any files with Scan Tailor yet, you should do that first! You can install Scan Tailor in Homebrew by doing

Code: Select all

brew install scantailor
Then run it by typing "scantailor" at the command prompt. Don't worry, it's a GUI app.

jbig2enc runs in two steps. Step one takes a page, or several pages, and produces raw JBIG2 data. That's not very useful for you, since you probably want to actually read your page, not just admire the file in your folder. For that, jbig2enc comes with a handy utility that converts the raw data into a PDF for you.

If you want to make a book that has OCR text (so you can search or copy and paste text), or a mixture of text and colour pictures, then a program called PDFBeads is the best way to do that. I'll show you how to do it with jbig2enc first, then how to use PDFBeads.

Let's assume you have a set of processed PDFs, and they're located in Documents/mybook. Here's what you would do:

1) Open Terminal.

2) Change to the folder containing your pages. We said that's in Documents/mybook, so you want to change to that directory. You can do that with the following command:

Code: Select all

cd ~/Documents/mybook
cd stands for change directory. The tilde (squiggly) represents your home folder, so you want that to come first since Documents is inside your home folder.

3) This folder should contain a whole bunch of TIFF files. Let's say they're named image001.tiff, image002.tiff, image003.tiff, etc. You want jbig2enc to compress all of those, and leave a single symbol file with one name that can be used to create a single PDF from. You can do that with the following command:

Code: Select all

jbig2 -b mybook -p -s image*.tiff
Let's break down what that means:

jbig2: This is just the program name.
-b mybook: This defines the "basename" for the files. Each page will be saved separately, but it will also create a single symbol file for all of your images, which will make it easier to combine it together into a single PDF.
-p: This makes sure the file is pdf-ready.
-s: This uses the symbol coder, which is best for making JBIG2 files from text.
image*.tiff: These are your images. The star is a wildcard, so it matches all .tiff files that start with "image".

jbig2enc will churn away for a bit, then tell you when it's done.

4) OK, now it's time to make a PDF! This is pretty easy.

Code: Select all

pdf.py mybook > mybook.pdf
"mybook" is the basename you used in the last step. The little arrow (">") tells it where to send the output, and "mybook.pdf" is the name of the book to write to. You can change all those names to whatever you want, of course.

pdf.py won't tell you when it's done, but once you get the command prompt back, it's finished! Take a look at the PDF file, and you should have no trouble reading it.

Making a PDF with OCR, or mixed image/text content, is a bit more complex. I'll show you how to do that in the next post.
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.

rxninja

Re: Mac users: jbig2enc is now in Homebrew! (Make better PDF

Post by rxninja » 05 Oct 2011, 14:50

This was, by far, one of the most helpful posts I've read on this entire forum. I've gone from going, "AHH! WHY ARE MY PDFS HUNDREDS OF MB?!" to having a fully processed, 285 page book that's only 1.6MB. Incredible. Bonus: I'm not so intimidated by Terminal anymore (even if I did use the DOS prompt back in the day to run games, that was a LONG time ago).

I'd love to hear what you have to say about how to do OCR. Currently, I use Acrobat Pro and just click Tools > Recognize Text and it processes the full PDF. Sometimes it even does slightly better skew correction than ScanTailor (even if ST has already done its own first).

So now my process looks like this:
1) Take pictures of every page.
2) Transfer every image into L and R folders.
3) Use a renaming program to sequentially rename every file. I use NameChanger for Mac. I just do a 1,2,3... renaming method, replace the entire file name, and append "1.jpg" to the left files and "2.jpg" to the right files.
4) Move everything to another folder (I call it "Ordered"). The files will magically be in order and read like "0011.jpg, 0012.jpg, 0021.jpg, 0022.jpg...etc." The first three digits are for ordering and the 4th digit denotes left and right. It may not be the best way to get ordering and combining done, but it's fast, it works, and I felt clever for figuring out that I could do it that way.
5) Open the contents of "Ordered" in ScanTailor and work ScanTailor's voodoo magic on them for page splitting, deskewing, content selection, margins, and conversion to black & white. This is by far the longest step.
6) Do the jbig2enc magic I just learned here to compress the ScanTailor output TIF files and combine them into a PDF.
7) Open the resulting PDF in Acrobat Pro and go to Tools > Recognize Text to do additional skew correction and OCR.

Do I have that about right? Are there any major steps I'm missing or anything I could be doing differently/better? I haven't had good luck at all with color images (single pages jump to 30+ MB, which seems insane), but I haven't tried them with this new TIF compression method. I'd eventually like to scan some of my graphic novels so I can bring them with me on my iPad.

Again, you've been tremendously helpful. Thank you so much for the detailed, quick response.

User avatar
Misty
Posts: 481
Joined: 06 Nov 2009, 12:20
Number of books owned: 0
Location: Frozen Wasteland

Re: Mac users: jbig2enc is now in Homebrew! (Make better PDF

Post by Misty » 05 Oct 2011, 15:30

Well, thanks! Glad I could help. I recognize a lot of this stuff is confusing.

Acrobat does a perfectly good job of doing OCR, as long as you find it's not pumping up your filesizes too much. The other option is to use an open-source OCR program plus PDFBeads to join it. The problem is that open source OCR just isn't as good as the commercial options, so accuracy may not be quite as nice.

I'll try to get the Mac user megathread up in the next few days - I'll include the PDFBeads instructions there.

Re: Colour images... yeah. That's a sticking point for sure. Books which are MIXED content (some text, some images) can compress nicely using a program like PDFBeads. Full colour pages? Not so much... That said, there are a couple of things you can do:

- Use a more efficient compression. TIFFs are gigantic and use inefficient compression for colour and greyscale content at the best of times. Something like lossy JPEG2000 will give you better compression than either TIFF or JPEG at comparable quality.
- Reduce the resolution of your images. You probably scanned them somewhere around 300DPI, and Scan Tailor boosted them to 600DPI. That's a good idea for text pages that are being bitonalized (turned into pure black and white), but not a good idea for pages that contain only colour or greyscale images. For reading, most of the time you can be very comfortable with images in the scale of 100 to 200DPI, and that saves you a ton of room.

PDFBeads has builtin options to deal with these kinds of issues. It's actually pretty awesome. The Mac user megathread, when it exists (which is not now) will tell you how to do that!
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.

Post Reply