So I have several million PDF files, I know some of these PDF's do not have any OCR and I want to obtain a list of the ones that have not been OCR'ed.
Trying to find the easiest and fastest way to auto check all my PDF files to identify the ones I need to send through an OCR.
Right now all the pdf files are sitting on 4-2TB drives which hold nothing but PDF files, the OS is windows NTFS drive format.
Any assistance would be greatly appreciated.
Batch find pdf with no OCR ..
Moderator: peterZ
Re: Batch find pdf with no OCR ..
You might try also posting your query on this dedicated PDF forum in the hope that there may be a way to inspect a file directly to determine if it has a text layer:
http://forums.planetpdf.com
Otherwise, it might be possible to develop a script to automate opening a file, selecting the text if there is any using 'select all' for example, pasting the result into a text editor, and then inspecting the result to determine if there is text present, but I wouldn't be too sure about that...
Edit:
PDF files are in part text files, so in principle it might be possible to automate opening each file in a text editor, and then searching for the presence of a text string that identifies the inclusion of a text layer in the file. But a PDF file containing an image or images may open as a very large text file...
http://forums.planetpdf.com
Otherwise, it might be possible to develop a script to automate opening a file, selecting the text if there is any using 'select all' for example, pasting the result into a text editor, and then inspecting the result to determine if there is text present, but I wouldn't be too sure about that...
Edit:
PDF files are in part text files, so in principle it might be possible to automate opening each file in a text editor, and then searching for the presence of a text string that identifies the inclusion of a text layer in the file. But a PDF file containing an image or images may open as a very large text file...
Re: Batch find pdf with no OCR ..
[Suggestion withdrawn pending further thought! ]
Re: Batch find pdf with no OCR ..
Another possibility might be to use Adobe Reader’s Edit > Advanced Search option to search for a selected term in all the PDF files in a folder in a single search, searching for a term that would be expected to be present in any file, such as the word ‘and’. Or, possibly in principle even for a character that will always be present in any searchable text and should OCR well, such as the letter ‘o’.
The files shown in the search results should then only include files that contain searchable text, although the output file list displayed would have to be processed manually to separate files containing searchable text from other files.
There are two possible practical complications:
A possible partial solution, but the practical application would need to be explored further...
The files shown in the search results should then only include files that contain searchable text, although the output file list displayed would have to be processed manually to separate files containing searchable text from other files.
There are two possible practical complications:
- The search performed on each file includes a search of any metadata (the text displayed in File > Properties... > Description) contained in the file, so that if files that do not have searchable text are to be excluded from the results, the search term used should not be present in any metadata contained in those files. There seems to be no option to exclude file metadata from the search.
The search results output displays the file ‘Title’ shown in the file metadata rather than the file name, unless the option in Edit > Preferences... > Search - ‘Show document title in search results’ has been unselected.
A possible partial solution, but the practical application would need to be explored further...
Re: Batch find pdf with no OCR ..
I would use something like xpdf (especially pdftotext) to try to extract text from said PDFs, and if it's empty, then there was no OCR.
As it's a commandline tool, it would be easy to make it scriptable
As it's a commandline tool, it would be easy to make it scriptable
Re: Batch find pdf with no OCR ..
I had the same problem a year ago, and here is a small script I made. I know this is a multi-programming-language forum, so here is a groovy script You need java and groovy (http://www.groovy-lang.org/) to run it.
To avoid checking the full PDF document (and to speed up the test), it will as default only check the first 5 pages for text (which means at least 100 characters on those 5 pages). But both parameters can be changed in the script together with the starting directory.
Frank
To avoid checking the full PDF document (and to speed up the test), it will as default only check the first 5 pages for text (which means at least 100 characters on those 5 pages). But both parameters can be changed in the script together with the starting directory.
Code: Select all
@Grab('com.itextpdf:itextpdf:5.5.5')
@Grab('org.bouncycastle:bcprov-jdk15on:1.49')
@Grab('org.bouncycastle:bcpkix-jdk15on:1.49')
import com.itextpdf.text.pdf.PdfReader
import com.itextpdf.text.pdf.parser.*
def startingDir = '.'
println "The following files do not contain text:"
new File(startingDir).eachFile { f ->
if (f.name.toLowerCase().endsWith('.pdf')
&& ! containsText(f.absolutePath)) {
println f.absolutePath
}
}
def containsText(filename, maxPagesToRead=5, minTextSize=100) {
PdfReader reader = new PdfReader(filename);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
TextExtractionStrategy strategy
def text = new StringBuilder()
for (int i = 1; i <= Math.max(maxPagesToRead, reader.getNumberOfPages()); i++) {
strategy = parser.processContent(i, new SimpleTextExtractionStrategy())
text.append(strategy.getResultantText())
if (text.size() > minTextSize) break
}
reader.close()
return text.size() > minTextSize
}
Re: Batch find pdf with no OCR ..
Have you found any possible and Viable solution for this problem Yet ?
Re: Batch find pdf with no OCR ..
Thanks for the ideas folks.
Will be testing a some out in a few days.
Will be testing a some out in a few days.