Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

Scantools for Linux - convert to PDF with OCR

Convert page images into searchable text. Talk about software, techniques, and new developments here.
Post Reply
Posts: 2
Joined: 16 Jan 2020, 05:05
E-book readers owned: Nook
Number of books owned: 0
Country: UK

Scantools for Linux - convert to PDF with OCR

Post by Krokkie » 16 Jan 2020, 06:20

Scantools for Linux - convert to PDF with OCR

It may interest some users in the community to produce OCR'd PDF's. There are already some solutions in place for this (such as pdfbeads or pdf.py) but how about just adding OCR on the fly by processing an existing scan to PDF or just add OCR to an existing PDF?

Scantools is a set of Linux PDF/A tools with the ability to perform OCR.



Scantools is a high-quality library and a matching set of command line programs for the handling and manipulation of scanned documents. The library is written in C++ and makes heavy use of Qt5.

At present, the library can convert image files to PDF/A. Files in JBIG2, JPEG and JPEG2000 format are directly included into the PDF, other files are compressed in a lossless manner. HOCR files, which are produced by optical character recognition programs such as ‘tesseract’, can be used to make the PDF file searchable. The resulting files comply with the ISO PDF/A standard for long-term archiving of digital documents and offer compression rates comparable to that of the DJVU file format.

There are currently three command line utilities.

image2pdf, converts images to a PDF/A compliant PDF file.
hocr2any, converts HOCR files to text, or renders them as raster graphics or PDF files
ocrPDF, adds a text layer to a graphics-only PDF file, without re-encoding graphics data or otherwise modifying file content

Downloads here

Posts: 1
Joined: 07 Dec 2019, 06:02
E-book readers owned: IRex, Kobo, BeBook, ...
Number of books owned: 0
Country: Netherlands

Re: Scantools for Linux - convert to PDF with OCR

Post by fjkraan » 15 Feb 2020, 17:15

This is very interesting, and I will look into it when I have my D.I.Y scanner upgrade completed. I did work with tesserract and hocr before, but what I didn't find was a tool for correcting the OCR text before adding it to the PDF. Editing hocr files directly is possible, but not very convenient. Does scantools has any support for this?


Fred Jan

Posts: 4
Joined: 02 Jun 2020, 13:29
Number of books owned: 0
Country: Rather

Re: Scantools for Linux - convert to PDF with OCR

Post by Noitaenola » 11 Jun 2020, 20:32

I'd also like a tool to easily edit hocr files. I haven't tried this yet, but seems to ease at least some of the work: PoCoTo: The CIS OCR PostCorrectionTool.

Post Reply