Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

How to convert a book to serchable pdf using open source software

Share your software workflow. Write up your tips and tricks on how to scan, digitize, OCR, and bind ebooks.
zbgns
Posts: 52
Joined: 22 Dec 2016, 06:07
E-book readers owned: Tolino, Kindle
Number of books owned: 600
Country: Poland

Re: How to convert a book to serchable pdf using open source software

Post by zbgns » 09 Jan 2020, 21:17

As you already wrote it was the sorting problem due to inconsistent naming of files. BTW this is not the Tesseract issue as it cannot process batch of separate files directly and the workaround is necessary by creating a list of files in right order which Tesseract may follow. This list was created by 'ls' command, namely by listing all 'tif' files in the working directory in name order and saving them to a text file ('output.txt' in this case). Potentially, you might avoid renaming files by change of sorting order, e.g. using sort by creation date (assuming there were created in correct sequence). Tesseract fully relies on this list and processes file by file in the given order.
Please also note that the scripts in this thread are very basic and prone to problems like this. They definitely do not follow good practices of writing shell scripts and I have not presented them as a complex working solution but rather as examples of my individual approach. More sophisticated implementation would be required before this is offered to others as a kind of 'universal' software.

Post Reply