From tiff-scans, ScanTailor and Tesseract to djvu-files - how?

Share your software workflow. Write up your tips and tricks on how to scan, digitize, OCR, and bind ebooks.

Moderator: peterZ

Post Reply
Alexander
Posts: 4
Joined: 04 Jan 2020, 05:38
Number of books owned: 0
Country: Germany

From tiff-scans, ScanTailor and Tesseract to djvu-files - how?

Post by Alexander »

Hi,

with xsane, I scanned books in the tiff-format (300dpi).
I optimized those tiff-files with ScanTailor. By using ScanTailor's "Splitting" feature, from each page three tiff-files have been generated. One with ...
  • ... all the content on it (background and foreground)
  • ... only the foreground on it, which seems to be everything which hasn't been marked as part of the Picture Zone. So, here we should see only text.
  • ... only the background on it, which seems to be especially the content of the Picture Zones.
Seems a pretty good basis to generate a size-optimized djvu-file - but how???

On the foreground-files I would like to run the OCR program tesseract. The results from tesseract than should be merged with the corresponding background-file and finally all files should be bundled into one djvu-file.

Now, I am stuck at the step with tesseract. Tesseract offers different output-formats (alto, hocr, pdf, tsv, txt, ...). Which would be the right one, so that I can later on merge it with the background file???
Alexander
Posts: 4
Joined: 04 Jan 2020, 05:38
Number of books owned: 0
Country: Germany

Re: From tiff-scans, ScanTailor and Tesseract to djvu-files - how?

Post by Alexander »

Meanwhile, I was able to find a working "tool-chain". All based on linux open source software.
Perhaps somebody has an idea, how to further optimize it:

Step 1: the scan
I am scanning with my flatbed scanner using sane.
Often I read, ths resolution should be at least 300dpi and a picture format should be chossen which neither has an alpha-channel nor a compression. I have no glue, if tif is fulfilling those two requirements, but tif is the format I chose.

Step 2: Picture correction with ScanTailor Advanced
What a great piece of software. Beside selecting the content, defining picture zones etc. I found the "Output" setting quite interesting. Since I want to create at the end djvu-files with a foreground and a background for each page, my output-settings are:
- Split output active
- B&W foreground active (no glue, what this is good for ...)
- Processing->Black on white mode acitve (no glue, what this is good for ...)

Step 3: convert each page to a djvu files
Step 2 generates from each page three files: (i) one tif-file which should the full page, (ii) one tif-file just with the foreground, which should be the pure text, and one tif-file with the background, mainly the images from the picture-zone.

Now some programs from the software-bundle djvulibe are needed.

Step 3.1: Creating rle-files for the foreground
Each tif-foreground file needs to be converted to a rle-file (since late on we will use cspedjvu which requires the rle-format for the foreground).
I have not found a way to directly generate rle-files from tifs, so I have to do a little roundabout:
$ cjb2 <Input>.tif <Output>.djvu
$ ddjvu -format=rle -v<Output from cjb2>.djvu <Output>.rle

Step 3.2. Creating ppm files for the background
Each tif-background file need to be converted to a ppm file (since late on we will use cspedjvu which requires the ppm-format for the foreground).
$ convert <Dateiname>.tif <Dateiname>.ppm

=> So, now we have for each single back one rle-file which represents the foreground and also a ppm-file which is the background of that page.

Step 4: Merging the fore- and background into one single file
$ cat <foreground>.rle <background>.ppm > <one_complete_page>.mix

=> So, now we have for each single page of the book exactly one .mix-file.

Step 5: Bundling all the pages together into one single djvu-file
$ csepdjvu -vv <page_1>.mix <page_2>.mix <page n>.mix out.djvu

Step 6: OCR
For the final text-recognition, I want to use tesseract, which unfortunately cannot handle djvu-files. But there the Wrapper ocrodjvu which solves that issue.
$ ocrodjvu -o <final_ebook>.djvu -l “<language_code>“ <output_file_from_Step5>.djvu


By far the most time consuming part is the OCR. I am wondering, if the -j option from ocroodjvu would speed this up (number of OCR threads)? Is there a relation between the threads and the cpu-cores. What amount of threads would be meaningful (I have an AMD cpu with 6 cores, and an nivida GPU)
zbgns
Posts: 61
Joined: 22 Dec 2016, 06:07
E-book readers owned: Tolino, Kindle
Number of books owned: 600
Country: Poland

Re: From tiff-scans, ScanTailor and Tesseract to djvu-files - how?

Post by zbgns »

Alexander wrote: 15 Feb 2020, 09:09
By far the most time consuming part is the OCR. I am wondering, if the -j option from ocroodjvu would speed this up (number of OCR threads)? Is there a relation between the threads and the cpu-cores. What amount of threads would be meaningful (I have an AMD cpu with 6 cores, and an nivida GPU)
I guess that you use tesseract 4.X. It has implemented multithreading, and the ocroodjvu -j option seems to be rather for limiting number of threads involved in OCR work (there apparently are some issues - see the ocroodjvu bugtracker). In general, tesseract is not very fast OCR tool, and it is difficult to speed it up significantly.

Resolution of pictures is another story. The standard for OCR is 300 dpi and tesseract follows that. So higher resolutions (e.g. to 600 DPI) usually do not lead to better OCR results. The OCR process would take longer, whereas the quality of recognition may be significantly worse. You did not write, what DPI of your OCRed documents is, but maybe the problem has something to do with DPI.
Post Reply