It only works for books where the page is written on the last line of the bottom (and the page must go though Scan Tailor in order for the OCR to work), but I will probably update that in the next version.
To use it, just save the script into the PARENT directory of the files you will be working with and run it from your favorite terminal emulator from inside the directory of the files, not the parent. This is for Linux only. I don't have a Mac, but it should be easy to modify, and I'm too lazy to do this on Windows. It should give you some progress on what it's doing.
It requires Tesseract and Python to work, but I'm not sure if it needs any more things to be installed.
Here it goes:
Code: Select all
#!/usr/bin/python
import os
import subprocess
from subprocess import Popen
from subprocess import PIPE
import re
class bcolors:
HEADER = '\033[95m'
OKBLUE = '\033[94m'
OKGREEN = '\033[92m'
WARNING = '\033[93m'
FAIL = '\033[91m'
ENDC = '\033[0m'
def disable(self):
self.HEADER = ''
self.OKBLUE = ''
self.OKGREEN = ''
self.WARNING = ''
self.FAIL = ''
self.ENDC = ''
def filecount(dir_name):
try:
return len([f for f in os.listdir(dir_name) if os.path.isfile(os.path.join(dir_name, f))])
except Exception, e:
return 0
print('Reading files...')
files = os.listdir('./')
print('There are ' + str(len(files)) + ' pages to process!')
print('')
print('Creating backup copies of images...')
num_originals = filecount('./')
for item in files:
Popen(['cp', item, item + '.crop.tif'])
print('Creating temporary directory...')
os.system('mkdir temp')
print('')
print('Moving backup files into temporary directory...')
os.system('mv *.crop.tif temp/')
print('')
print('Beginning the OCR process...')
num_new = 0
files_new = []
for image in files:
height = Popen(['tesseract', 'temp/' + image + '.crop.tif', 'temp/' + image + '.txt'], stdout=open(os.devnull, "w"), stderr=subprocess.STDOUT).wait()
num_new += 1
data_stream = open('temp/' + image + '.txt.txt', 'r')
lines = "".join(data_stream.readlines())
data_stream.close()
lines_list = lines.replace('\r', '').rstrip('\n\n').split('\n')
data = re.sub("[^0-9]", '', lines_list[len(lines_list) - 1])
files_new.append(data)
print('Read ' + bcolors.OKBLUE + str(num_new) + bcolors.ENDC + ' out of ' + bcolors.OKBLUE + str(num_originals) + bcolors.ENDC + ' [' + bcolors.OKBLUE + str(int(100 * num_new / num_originals)) + '%' + bcolors.ENDC + ']: ' + bcolors.OKBLUE + image + bcolors.ENDC + ' => Page ' + bcolors.OKBLUE + str(data) + bcolors.ENDC)
print('')
print('Cleaning up...')
os.system('rm -Rf temp')
print('')
print('Renaming the original files...')
index = 0
while index < len(files_new):
Popen(['mv', files[index], files_new[index] + ".tif"])
print('Renamed ' + bcolors.OKBLUE + str(files[index]) + bcolors.ENDC + ' to ' + bcolors.OKBLUE + str(files_new[index] + ".tif") + bcolors.ENDC)
index += 1
print('')
print('All done!')
Code: Select all
admin@induction:~/Renamer$ python ../pager.py
Reading files...
There are 1 pages to process!
Creating backup copies of images...
Creating temporary directory...
Moving backup files into temporary directory...
Beginning the OCR process...
Read 1 out of 1 [100%]: sailfhaoifaoisdghfosdhg.tif => Page 222
Cleaning up...
Renaming the original files...
Renamed sailfhaoifaoisdghfosdhg.tif to 222.tif
All done!