Batch find pdf with no OCR ..

General discussion about software packages and releases, new software you've found, and threads by programmers and script writers.

Moderator: peterZ

Post Reply
InfoMania
Posts: 2
Joined: 22 Mar 2015, 18:32
Number of books owned: 0
Country: China

Batch find pdf with no OCR ..

Post by InfoMania »

So I have several million PDF files, I know some of these PDF's do not have any OCR and I want to obtain a list of the ones that have not been OCR'ed.

Trying to find the easiest and fastest way to auto check all my PDF files to identify the ones I need to send through an OCR.

Right now all the pdf files are sitting on 4-2TB drives which hold nothing but PDF files, the OS is windows NTFS drive format.

Any assistance would be greatly appreciated.
cday
Posts: 451
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Batch find pdf with no OCR ..

Post by cday »

You might try also posting your query on this dedicated PDF forum in the hope that there may be a way to inspect a file directly to determine if it has a text layer:

http://forums.planetpdf.com

Otherwise, it might be possible to develop a script to automate opening a file, selecting the text if there is any using 'select all' for example, pasting the result into a text editor, and then inspecting the result to determine if there is text present, but I wouldn't be too sure about that... ;)

Edit:

PDF files are in part text files, so in principle it might be possible to automate opening each file in a text editor, and then searching for the presence of a text string that identifies the inclusion of a text layer in the file. But a PDF file containing an image or images may open as a very large text file...
cday
Posts: 451
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Batch find pdf with no OCR ..

Post by cday »

[Suggestion withdrawn pending further thought! ]
cday
Posts: 451
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Batch find pdf with no OCR ..

Post by cday »

Another possibility might be to use Adobe Reader’s Edit > Advanced Search option to search for a selected term in all the PDF files in a folder in a single search, searching for a term that would be expected to be present in any file, such as the word ‘and’. Or, possibly in principle even for a character that will always be present in any searchable text and should OCR well, such as the letter ‘o’.

The files shown in the search results should then only include files that contain searchable text, although the output file list displayed would have to be processed manually to separate files containing searchable text from other files.

There are two possible practical complications:
  • The search performed on each file includes a search of any metadata (the text displayed in File > Properties... > Description) contained in the file, so that if files that do not have searchable text are to be excluded from the results, the search term used should not be present in any metadata contained in those files. There seems to be no option to exclude file metadata from the search. :?:

    The search results output displays the file ‘Title’ shown in the file metadata rather than the file name, unless the option in Edit > Preferences... > Search - ‘Show document title in search results’ has been unselected.
In addition, if the file full path is to be displayed in the results, the option in the search output to ‘Collapse file paths’ should be unselected.

A possible partial solution, but the practical application would need to be explored further...
adong
Posts: 2
Joined: 07 Nov 2013, 17:37
Number of books owned: 0
Country: France

Re: Batch find pdf with no OCR ..

Post by adong »

I would use something like xpdf (especially pdftotext) to try to extract text from said PDFs, and if it's empty, then there was no OCR.
As it's a commandline tool, it would be easy to make it scriptable :)
mera461
Posts: 7
Joined: 27 Dec 2013, 07:08
Number of books owned: 0
Country: Denmark

Re: Batch find pdf with no OCR ..

Post by mera461 »

I had the same problem a year ago, and here is a small script I made. I know this is a multi-programming-language forum, so here is a groovy script :-) You need java and groovy (http://www.groovy-lang.org/) to run it.

To avoid checking the full PDF document (and to speed up the test), it will as default only check the first 5 pages for text (which means at least 100 characters on those 5 pages). But both parameters can be changed in the script together with the starting directory.

Code: Select all

@Grab('com.itextpdf:itextpdf:5.5.5')
@Grab('org.bouncycastle:bcprov-jdk15on:1.49')
@Grab('org.bouncycastle:bcpkix-jdk15on:1.49')

import com.itextpdf.text.pdf.PdfReader
import com.itextpdf.text.pdf.parser.*

def startingDir = '.'
println "The following files do not contain text:"
new File(startingDir).eachFile { f -> 
	if (f.name.toLowerCase().endsWith('.pdf') 
		&& ! containsText(f.absolutePath)) {
		println f.absolutePath
	}
}

def containsText(filename, maxPagesToRead=5, minTextSize=100) {
	PdfReader reader = new PdfReader(filename);
	PdfReaderContentParser parser = new PdfReaderContentParser(reader);
	TextExtractionStrategy strategy
	def text = new StringBuilder()
	for (int i = 1; i <= Math.max(maxPagesToRead, reader.getNumberOfPages()); i++) {
		strategy = parser.processContent(i, new SimpleTextExtractionStrategy())
		text.append(strategy.getResultantText())
		if (text.size() > minTextSize) break
	}
	reader.close()
	return text.size() > minTextSize
}
Frank
myfreeocr
Posts: 4
Joined: 02 Apr 2015, 06:26
Number of books owned: 0
Country: India
Contact:

Re: Batch find pdf with no OCR ..

Post by myfreeocr »

Have you found any possible and Viable solution for this problem Yet ?
InfoMania
Posts: 2
Joined: 22 Mar 2015, 18:32
Number of books owned: 0
Country: China

Re: Batch find pdf with no OCR ..

Post by InfoMania »

Thanks for the ideas folks.

Will be testing a some out in a few days.
Post Reply