Page 1 of 1

Cataloging exhibition material - best way? (method/software)

Posted: 04 May 2016, 22:37
by yello
Hope I can collect some ideas here. I often go to exhibition and collect tons of catalogs, leaflets etc. Which of course 6 month later you can't find in the meter high stack of paper. I like to make that searchable.

How to scan them is clear. But what next?

I have enough HDD storage. On top of the scan I can OCR everything that is important.

My idea is:

1. Put scans in one folder (or several for different topics)
2. Put the OCR (text only) in a MYSQL DB
3. Link the text with a Image (collect keywords too)

In that case I should be able to do fast searches and get the image if needed.

For that it would be easy for me to make it browser accessible, write it in PHP with MySQL as DB.

Anybody here doing something similar? How you do it?

Re: Cataloging exhibition material - best way? (method/software)

Posted: 13 May 2016, 02:26
by peterZ
Could Try a Fujitsu ScanSnap ix500
and then some indexing and search software

Re: Cataloging exhibition material - best way? (method/software)

Posted: 13 May 2016, 04:11
by BruceG
yello
If you happen to have Acrobat you could save all the material with appropriate naming as pdf's.
Then with Acrobat create a index/catalog of all the material, this can be added to with each exhibition. Searching is very quick. Picks any word as long as the doc has been OCRed.
I use this method for Magazines some of which have been going for 100+ years. You could do every pdf file you have.
When searching the document comes up with every page the word is mentioned. For magazines I create a file for each year, you could do this for each exhibition .
Trust this makes sense.

Re: Cataloging exhibition material - best way? (method/software)

Posted: 01 Dec 2017, 20:49
by yello
Well, I didn't come back for quite a while. My problem is still there. I have a new plan though.

I bought a book scanner and wait for delivery (Czur ET16)

Here is what I think I can do:

• Scan and OCR the catalogs (ET16 uses Abbyy Finereader)
• Save as readable PDF in a folder

Here the ET16 ends...

Then write a PHP page that:

• Extracts the text from the PDF
• Puts the text (without formatting) in a MySQL DB
• Link the DB TXT to that PDF
• Create a PHP page with fields for entering, editing etc.
• Have several fields to may add some keywords, source, company name, notes etc
• Create a search page

Now I can do a keyword search, or a full text search finding the catalog or page I want to find and bring up the PDF.

How does that sound?

Re: Cataloging exhibition material - best way? (method/software)

Posted: 03 Dec 2017, 18:27
by BruceG
What you are trying to do is what Adobe Acrobat does already although at a great cost. It is the only program that indexes pdf files that I have found. Writing such a program is beyond me.
You can get a trial version for 30 days I think. So once you have scanned all your material, give it a try. As you would not have to index often, someone with Acrobat may do it for you.

I also have a ET16. The new model is better at scanning shiny paper as there are lights lower down, well that is what I have read.

A pdf editor may also may be helpful to make sure certain words are spelt right. (keywords, source, company name, notes etc) Searching will otherwise be difficult.

Re: Cataloging exhibition material - best way? (method/software)

Posted: 11 Dec 2017, 20:26
by yello
@BruceG
Well, Acrobat costs something like US$100 - per year! If I can do that DIY I try that first.

So what did happen in the last few days:
  • I got the Czur ET16 from China
  • China models work only inside China (doh!) - don't buy in China!
  • I could make it work with an older software version
I have the normal model, not the PLUS one with additional lights. It scans OK as long as the paper isn't shiny. May targets are catalogs and brochures that are very often shiny. So my next stepp is to build some additional lighting to fix that.

The scans are 'OK', they are of course far worse than flatbed scans.

Can't say much about the OCR accuracy, only so much: it's not that relevant for me. I plan to grab the text and store the full text in a MySQL DB. There I can do a full text search very fast - and I I want bring up the original scan readable PDF. I don't plan to search through the searchable PDFs from the PC, that might be slow. MySQL is a fraction of a second.

Additional to the full text in the DB I plan to add relevant keywords. Like, in case I had a few 1000 car catalogs I would add: BMW, SUV - or something like that.

The actual grabbing and storing in MySQL I plan to do the the github pdfparser.org software. It works great on some PDF, however Czurs Abbyy saves PDFs in the format 1.5 which PDFParser can't handle. Easy way out to bundle that software with Githubs xthiago/pdf-version-converter and bring it down to V1.4

I have the PDFParser and pdf-version-converter already working separately. To put them together into one is easy. To write the PHP for putting text in MySQL and design a user interface page is rather easy too.

Would I buy the Czur ET16 again? Mhm, certainly not in China. I might go for the PLUS model next time. The shiny paper issue makes the normal model unusable on glossy paper.

Re: Cataloging exhibition material - best way? (method/software)

Posted: 12 Dec 2017, 04:33
by BruceG
Yes Acrobat costs a bit, I use Acrobat 9 which does not have an annual fee. You can pay only one month at a time or yearly for the new version, the old one still does what I need.
I index sets of material separately ie. magazines, books sets and altogether (as most of what I do has the same subject) and usually search on people and locations. It is fast and the page the word or phrase is on comes up. What took weeks of reading can now be done in seconds. The index is searched not the pdf's.

As for the ET16, I got mine from China. To use the up to date software, you need the up to date firmware. Or it will not work.
Most of my material is done on a flat bed scanner. Recently I did some minute books from the 1800's that were falling apart, the ET16 was good for that in that further damage was limited.

Shiny paper is difficult. Turning the lights off and use another light source lower down is my suggestion. Or light through a window.

Re: Cataloging exhibition material - best way? (method/software)

Posted: 14 Dec 2017, 15:33
by cday
BruceG wrote: 12 Dec 2017, 04:33Shiny paper is difficult. Turning the lights off and use another light source lower down is my suggestion. Or light through a window.
Could a polarization filter, or possibly two filter sheets at 90 degrees with one in front of the light source, possibly provide a solution?

For example: Polarizing filter (photography)

Re: Cataloging exhibition material - best way? (method/software)

Posted: 27 Dec 2017, 01:52
by yello
cday wrote: 14 Dec 2017, 15:33
BruceG wrote: 12 Dec 2017, 04:33Shiny paper is difficult. Turning the lights off and use another light source lower down is my suggestion. Or light through a window.
Could a polarization filter, or possibly two filter sheets at 90 degrees with one in front of the light source, possibly provide a solution?

For example: Polarizing filter (photography)
I got some pol sunglasses from the movies and can give it a try. But I don't really think so.

What I did in the meantime: I took out the LED bars from my photo-box and used them. They are much brighter than the original ET16 LEDs.

Good I thought... However, the scanner projects laser lines on the paper and uses the reflection to calculate the curve and correct it somehow. With too much light it can't 'see' the lines. Plus, the fixing of the photo-box panels (50 x 400mm @ 30 LEDs) is a bit tricky.

Hence I will try now: I bought some 12V LED strips (single color, 6000k), they can be cut every 5cm. I will cut them in 20cm strips and use one (or more) on each side. Also have a Meanwell 12V LED driver (they don't flicker) and a dimmer (which I hope doesn't flicker).

I also also put a light measuring app on my phone and will set the new light intensity to the same as the installed light.

Will update soon.