Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

Make a djvu file and add ocr: DjVuToy; TiffDjvuOcr; CuneiDjVu

General discussion about software packages and releases, new software you've found, and threads by programmers and script writers.
Posts: 46
Joined: 30 Nov 2012, 21:37
Number of books owned: 0
Country: UK

Re: Make a djvu file and add ocr [metadata injection]

Post by b0bcat » 25 Jun 2020, 13:46

At the weekend I explored two aspects of DjVu files:

1. injecting metadata into the file;
2. using Gimp v.2.8.18, DjVu Small and DjVu Imager to make a mixed text/image DjVu page from a single .tif image output by ScanTailor Advanced v.1.0.14.

I set out here the outline notes of my initial fumbling exploratory steps as they may serve to provoke replies giving better-informed approaches to the issues and/or serve as a starting point for someone else who's about to explore the same issues.

This post is concerned with metadata injection; the succeeding post, with mixed text/image DjVu file page production.

http://www.djvu.hu/content/img/djvu_met ... aft_10.txt
"Metadata can be set and retrieved by djvused (part of DjVuLibre). It may also be read under "Metadata..." menu in DjView. Currently, WinDjVu does not support [n]either reading [n]or writing any kind of metadata.

(metadata ...
(key value). ..)
Define meta-data entries. Each entry is identified by a symbol key representing the nature of the meta data entry. The string value represents the value associated with the corresponding key. Two sets of keys are noteworthy: keys borrowed from the BibTex bibliography system, and keys borrowed from the PDF DocInfo metadata.
BibTex keys are always expressed in lowercase, such as year, booktitle, editor, author, etc..
DocInfo keys start with an uppercase letter, such as Title, Author, Subject, Creator, Produced, Trapped, Creation Date, and ModDate. The values associated with the last two keys should be dates expressed according to RFC 3339."

I was unable to find a PDF DocInfo page listing the fields, offhand the nearest relevant being:
https://helpx.adobe.com/acrobat/using/p ... adata.html

Using DjViewLibre v.4.10.4 (on MS Windows 7) cmd-line program djvused.exe
No environmental settings on the PC were changed, the DjVu file was copied to the folder where djvused.exe resides with the rest of the DjViewLibre suite and then djvused.exe was highlighted in Explorer and send-to: cmd prompt applied to open a cmd window and begin inputting the data.

Write e.g.:
C:\Program Files (x86)\DjVuLibre>djvused.exe source.djvu
author BLOGGS, Joe
title My Book of Words

C:\Program Files (x86)\DjVuLibre>djvused.exe source.djvu
author "BLOGGS, Joe"
title "My Book of Words"

It is possible to mix BibTex and PDF DocInfo tags in the same input session e.g.

author BLOGGS, Joe
title My Book of Words
note Converted from {another ridiculous Internet Archive upload}
publisher Megacorp-Monopolies, Inc.
doi {an ISBN number?}
Keywords think, of, a, word, that, rhymes
Subject The pitiful ramblings of an illiterate

NB: one must input all fields and save, otherwise trying to add fields later deletes ones already existing.

NB2: The content of fields input in BibTex metadata format does not display in www.cuminas.jp DjVu Shell Extension (see below), however that does display the content of PDF DocInfo format fields; and both types can be displayed using djvused.exe or View/Metadata in the DjView.exe viewer.

ExifTool v12.01 displays at end of its ALL list the data fields content under its heading:

"DjVu Shell Extension Pack (FREE)
DjVu Shell Extension Pack is an extension package for Windows, which enables you to take advantages of DjVu’s various features.

You can see DjVu thumbnails on Windows Explorer.
You can search DjVu files using Windows Search.
You can see DjVu preview on Windows Explorer and Microsoft Outlook.
You can see/edit DjVu metadata on Windows Explorer.
You can see DjVu files using Windows Photo Gallery, Windows Live Photo Gallery and any .NET Framework 3.0/Windows Imaging Codec based applications.

The package contains IFilter, WIC codec and Property Store."

Posts: 46
Joined: 30 Nov 2012, 21:37
Number of books owned: 0
Country: UK

Re: Make a djvu file and add ocr [mixed text/image tif encoding to DjVu file]

Post by b0bcat » 25 Jun 2020, 14:28

Using Gimp v.2.8.18, DjVu Small and DjVu Imager to make a mixed text/image DjVu page from a single .tif image output by ScanTailor Advanced v.1.0.14.
My preliminary notes for converting to DjVu one or two isolated tif files of mixed text/image.

Simulating the "ScanTailor Featured" export tif split by creating foreground and background tif files for conversion into DjVu and merging using DjVu Imager. Perhaps easier for one or two isolated mixed text/image page DjVu conversions than using the methodology of e.g. ScanTailor Advanced's output splitting for the tif files followed by processing by DjVu Imager etc.

1. Save foreground 'text' part of source tif file as new tif:
Open mixed image/text tiff file in GIMP and select image part. Cut that out and export remainder as new file with same name to folder ending /export/1 retaining format of tif file i.e. greyscale or colour.

2. Save background 'image' part of source tif file as new tif:
In GIMP undo the cut and invert selection then press cut so this time text is removed; export to folder ending \export\2 as new file with same name retaining format of tiff file ie greyscale or colour.

3. in GIMP close the mixed text/image source tif file without saving it so original tif is preserved unchanged.

4. using DjVu Small convert to DjVu the foreground 'text' tif source file in folder ending \export\1
Place (or output) the new DjVu file in folder ending \export\1.
Then change name of that new 'text'/foreground DjVu file to
DjVu Encoded1.djvu
leaving it in its selected folder ending \export\1.

5. using DjVu Imager convert the background 'image' tif source file in folder \export\2 to a DjVu file:
in DjVu Imager open from selected folder \export\2 the 'image'/background tif file and press tab 'Convert'. The output 'image'/background DjVu is saved by DjVu Imager in its ..\tmp\images folder.

6. Combine the background and foreground DjVu files into a new DjVu file:
Check that the paths in DjVu Imager point to the ..\export\1 and ..\export\2 locations then press tab 'Insert in DjVu'. Text and image DjVu files are merged into a new DjVu file located in selected folder ending ..\export\2.
N.B.: in DjVu Imager in file name input at utmost top left of the program window, to the right side of the filename, change # to 1 at least when operating on a single file pair, otherwise will not match and paste will fail. Seems DjVu Imager converts the image file to DjVu into its temp directory at ..\djvu_imager_v2_9\tmp\images

7. The merged DjVu file output by DjVu Imager is "DjVu Encoded1.out.djvu" and placed by DjVu Imager in the ..\export\2 location. Rename the merged DjVu file to corresponding original mixed tif filename but with DjVu suffix (and if encoding using not DjVu Small but e.g. DjVuSolo, adding _0001 at end e.g. Test_001_0001.djvu).
Then insert it into new or existing DjVu using Editor function of DjVuToy v.2.10.

Example paths for the tif/DjVu files to be combined:
c:\Scan\tiff\!DjVuImager\export\1\DjVu Encoded1.djvu [Source]
c:\Scan\tiff\!DjVuImager\export\2\DjVu Encoded1.out.djvu [Dest.]

Posts: 5
Joined: 02 Jun 2020, 13:29
Number of books owned: 0
Country: Rather

Re: Make a djvu file and add ocr: DjVuToy; TiffDjvuOcr; CuneiDjVu

Post by Noitaenola » 27 Oct 2020, 17:14

I wrote a little python script for inserting metadata in djvu files. It basically uses the same procedure you do but it's already automated. I referenced ExifTool's docs for the available fields.

As for the mixed text/image DjVus, I think it's faster to feed Scan Tailor Advanced's "Split output" images to DjVuImager than manually processing page by page with GIMP (or any other software).

Post Reply