pdf library search optimization

General discussion about software packages and releases, new software you've found, and threads by programmers and script writers.

Moderator: peterZ

Post Reply
revwarguy
Posts: 12
Joined: 21 Feb 2012, 21:57
E-book readers owned: samsung 10.1 tablet
Number of books owned: 3000

pdf library search optimization

Post by revwarguy »

I am about to complete my scanner, and soon I hope to have a large collection of pdf files, some of which are related, some of which are not. I plan on organizing related pdfs using a typical directory structure.

I will also be using ABBYY Finereader to create the pdfs. The base documents need to keep high res images in them.

I am hoping those who are further down the road than I am can offer some advice about:

1. optimizing individual pdfs for search speed - when does splitting off the text become worth it overall? Are there external indexing tools? (I am not talking about the Acrobat indexing here, but some automatic way of pointing to a directory of pdfs and creating a text index for that directory, and then searching that index.

2. optimizing any architectural factors for search speed, like:
a. if a group of directories are going to be searched frequently, what amount of overall efficiency is gained by copying the PDFs to an SSD first?
b. what can be done to take advantage of mutli-core processors?
c. any penalty for searching multiple directories over single directories

I am just getting started down this road, so any tips you can offer toward this end would be greatly appreciated.

TIA,
BruceG
Posts: 99
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: pdf library search optimization

Post by BruceG »

I do not know of other indexing methods besides Acrobat Indexing which I use.

As for speed I just checked SSD vs HD. 107 years of magazines. For testing I search on 'China'. SSD 5sec HD 5+sec. Increasing the number of documents to 188 (Books & other years of magazines) increased the search time to 7 and 8 sec. The processor has 6 core (5820K) CPU % went up 7% and memory stayed the same. Creating the index would be more taxing on the computer.
When I was looking it was important that anyone could search the index of the documents on their own computer with free software, in my case Acrobat reader.

I index groups of books or magazine titles separately and then index the whole collection.
Post Reply