Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

OCR Page Namer

Convert page images into searchable text. Talk about software, techniques, and new developments here.
Anonymous1

OCR Page Namer

Post by Anonymous1 » 09 Nov 2010, 18:44

After going through a processed 650 page book by hand and renaming every single file manually, I decided to create a little Python script to do it for me automagically.

It only works for books where the page is written on the last line of the bottom (and the page must go though Scan Tailor in order for the OCR to work), but I will probably update that in the next version.

To use it, just save the script into the PARENT directory of the files you will be working with and run it from your favorite terminal emulator from inside the directory of the files, not the parent. This is for Linux only. I don't have a Mac, but it should be easy to modify, and I'm too lazy to do this on Windows. It should give you some progress on what it's doing.

It requires Tesseract and Python to work, but I'm not sure if it needs any more things to be installed.

Here it goes:

Code: Select all

#!/usr/bin/python

import os
import subprocess
from subprocess import Popen
from subprocess import PIPE
import re

class bcolors:
    HEADER = '\033[95m'
    OKBLUE = '\033[94m'
    OKGREEN = '\033[92m'
    WARNING = '\033[93m'
    FAIL = '\033[91m'
    ENDC = '\033[0m'

    def disable(self):
        self.HEADER = ''
        self.OKBLUE = ''
        self.OKGREEN = ''
        self.WARNING = ''
        self.FAIL = ''
        self.ENDC = ''

def filecount(dir_name):
    try:
        return len([f for f in os.listdir(dir_name) if os.path.isfile(os.path.join(dir_name, f))])
    except Exception, e:
        return 0

print('Reading files...')
files = os.listdir('./')
print('There are ' + str(len(files)) + ' pages to process!')

print('')

print('Creating backup copies of images...')
num_originals = filecount('./')

for item in files:
  Popen(['cp', item, item + '.crop.tif'])

print('Creating temporary directory...')
os.system('mkdir temp')

print('')

print('Moving backup files into temporary directory...')
os.system('mv *.crop.tif temp/')

print('')

print('Beginning the OCR process...')

num_new = 0
files_new = []

for image in files:
  height = Popen(['tesseract', 'temp/' + image + '.crop.tif', 'temp/' + image + '.txt'], stdout=open(os.devnull, "w"), stderr=subprocess.STDOUT).wait()
  num_new += 1
  
  data_stream = open('temp/' + image + '.txt.txt', 'r')
  lines = "".join(data_stream.readlines())
  data_stream.close()
  
  lines_list = lines.replace('\r', '').rstrip('\n\n').split('\n')
  data = re.sub("[^0-9]", '', lines_list[len(lines_list) - 1])
  
  files_new.append(data)
  
  print('Read ' + bcolors.OKBLUE + str(num_new) + bcolors.ENDC + ' out of ' + bcolors.OKBLUE + str(num_originals) + bcolors.ENDC + ' [' + bcolors.OKBLUE + str(int(100 * num_new / num_originals)) + '%' + bcolors.ENDC + ']: ' + bcolors.OKBLUE + image + bcolors.ENDC + ' => Page ' + bcolors.OKBLUE + str(data) + bcolors.ENDC)

print('')

print('Cleaning up...')
os.system('rm -Rf temp')

print('')

print('Renaming the original files...')

index = 0

while index < len(files_new):
  Popen(['mv', files[index], files_new[index] + ".tif"])
  print('Renamed ' + bcolors.OKBLUE + str(files[index]) + bcolors.ENDC + ' to ' + bcolors.OKBLUE + str(files_new[index] + ".tif") + bcolors.ENDC)
  index += 1

print('')

print('All done!')
Here's a fake screenshot to make it seem more friendly (I doubt that it will help):

Code: Select all

admin@induction:~/Renamer$ python ../pager.py 
Reading files...
There are 1 pages to process!

Creating backup copies of images...
Creating temporary directory...

Moving backup files into temporary directory...

Beginning the OCR process...
Read 1 out of 1 [100%]: sailfhaoifaoisdghfosdhg.tif => Page 222

Cleaning up...

Renaming the original files...
Renamed sailfhaoifaoisdghfosdhg.tif to 222.tif

All done!
Have fun! ;)

User avatar
daniel_reetz
Posts: 2797
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: OCR Page Namer

Post by daniel_reetz » 10 Nov 2010, 12:27

So wait a minute... this script reads out the page number from an OCR'ed image file, and renames it according to the page number?

Amazing!

User avatar
dingodog
Posts: 108
Joined: 22 Jul 2010, 18:19
Number of books owned: 1000
Country: on the net
Location: on the net
Contact:

Re: OCR Page Namer

Post by dingodog » 10 Nov 2010, 12:56

it is a not so good idea

in a book, there may are ROMAN pages (e.g. I, II, III, IV, and so on...), BLANK pages, NOT NUMBERED PAGES (e.g. INDEX, GLOSSARIES, and so on...)

User avatar
Misty
Posts: 481
Joined: 06 Nov 2009, 12:20
Number of books owned: 0
Location: Frozen Wasteland

Re: OCR Page Namer

Post by Misty » 10 Nov 2010, 13:05

There are enough books that follow the appropriate pattern that this could be very useful. Clever idea, Anonymous! And fantastic first post.

I might attempt a Windows port if I end up needing it, though I'm not sure if that will come up.
The opinions expressed in this post are my own and do not necessarily represent those of the Canadian Museum for Human Rights.

univurshul
Posts: 496
Joined: 04 Mar 2014, 00:53

Re: OCR Page Namer

Post by univurshul » 10 Nov 2010, 13:18

dingodog wrote:it is a not so good idea...in a book, there may are ROMAN pages (e.g. I, II, III, IV, and so on...), BLANK pages, NOT NUMBERED PAGES (e.g. INDEX, GLOSSARIES, and so on...)
From a hardware advancement standpoint I think this is a great idea. When we eventually scan and process in realtime, this thing is huge. There was talk about having intelligent software that can tell what page was scanned, and track/check for skips for auto-page turning.

Bear in mind it's early. There's a boat-load of variables right now; like a cruise ship getting tugged back to San Diego by the Mexican Navy. But we have a starting point. Let it grow. Roman numeral/numbers can probably be calculated.

Denivic
Posts: 13
Joined: 08 Nov 2010, 02:56
Number of books owned: 0
Country: USA
Location: Online
Contact:

Re: OCR Page Namer

Post by Denivic » 10 Nov 2010, 14:18

Excellent idea Anonymous!

I came on this forum today to research OCR software that people are using, but again to my astonishment you guys are on top of everything. I'm having the exact problem Anonymous describe...manually renaming each file. I was sorting through thousands of scan Jpeg images yesterday and I confuses myself..why you ask...it's because I don't have any type of organization with my scan doc. So I am curious to know what you guys are doing to organize your data? Is everyone manually changing the file names or is there a batch images editor available for MAC or PC.

Anonymous1

Re: OCR Page Namer

Post by Anonymous1 » 10 Nov 2010, 15:10

So I guess people might need this then! I'm working on a GUI version of this which will allow you to basically select a box on a sample image and it uses that to mask the OCR (making it faster).

AFAICT, this should also work for roman numerals with slight modification (just remove the carat in the regex and you're all set to go), as it only reads the last line of text. As for blank pages and unnumbered pages, I named those manually (I bet it would take me much longer to code it rather than to script it). I might fix that in the GUI version with masking.

spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: OCR Page Namer

Post by spamsickle » 10 Nov 2010, 16:17

Denivic wrote: I am curious to know what you guys are doing to organize your data? Is everyone manually changing the file names or is there a batch images editor available for MAC or PC.
When I shoot a book, I create a folder with the name of the book as its title, then within that folder I create L and R folders to hold the images from each camera. A simple script renames the files and sorts them into sequence. I actually copy the files when I do this, because I've run across situations in which it's good to still have L and R images available separately later. This thread talks about a couple of methods for merging and ordering images, which is not quite the same as naming them by page.

I don't know enough about PDF and other book formats, but I assume there is some format which actually uses the name of an individual page for access? So if I'm looking at the table of contents, and I want to go to page 64, I can do that directly by name, rather than getting close and flipping around forward and backward to get it exactly?

univurshul
Posts: 496
Joined: 04 Mar 2014, 00:53

Re: OCR Page Namer

Post by univurshul » 13 Nov 2010, 02:09

Denivic wrote:...I am curious to know what you guys are doing to organize your data? Is everyone manually changing the file names or is there a batch images editor available for MAC or PC.
For OSX comprehensive & free GUI book page ordering, see here: http://www.diybookscanner.org/forum/vie ... ?f=3&t=527

dansheffler

Re: OCR Page Namer

Post by dansheffler » 14 Nov 2010, 12:08

I use File Wrangler to rename my pages. It is very easy to use. Here is a quick screen shot of my process:
Picture 1.png
Notice that I am renaming the files with some text at the front (I usually do author last name and Page), then the sequencer will add numbers to the end of the file. You can step them by 2 if you have all odd pages, begin them on whatever number you want, and add a pad to define a minimum number of digits.

For the most part I don't use two different left and right folders. I begin by simply opening all the odd pages, renaming with a step of 2 beginning at 1, then I clear all of those and repeat the process open all the even pages and starting at 2. Under some circumstances though, I have found it useful to keep them separate and then copy them into one folder for scan tailor. For instance, I read quite a few books with facing Greek and English, I only want to do the OCR on the English pages etc.

I really like your OCR solution though since this could come in handy in many situations.

Post Reply