Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

Cataloging exhibition material - best way? (method/software)

Whatever.
Post Reply
yello
Posts: 5
Joined: 02 Jan 2016, 23:35
E-book readers owned: Sony
Number of books owned: 30
Country: Hong Kong SAR China

Cataloging exhibition material - best way? (method/software)

Post by yello » 04 May 2016, 22:37

Hope I can collect some ideas here. I often go to exhibition and collect tons of catalogs, leaflets etc. Which of course 6 month later you can't find in the meter high stack of paper. I like to make that searchable.

How to scan them is clear. But what next?

I have enough HDD storage. On top of the scan I can OCR everything that is important.

My idea is:

1. Put scans in one folder (or several for different topics)
2. Put the OCR (text only) in a MYSQL DB
3. Link the text with a Image (collect keywords too)

In that case I should be able to do fast searches and get the image if needed.

For that it would be easy for me to make it browser accessible, write it in PHP with MySQL as DB.

Anybody here doing something similar? How you do it?

peterZ
Posts: 16
Joined: 16 Jun 2013, 06:13
Number of books owned: 10000
Country: Australia

Re: Cataloging exhibition material - best way? (method/software)

Post by peterZ » 13 May 2016, 02:26

Could Try a Fujitsu ScanSnap ix500
and then some indexing and search software

BruceG
Posts: 63
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: Cataloging exhibition material - best way? (method/software)

Post by BruceG » 13 May 2016, 04:11

yello
If you happen to have Acrobat you could save all the material with appropriate naming as pdf's.
Then with Acrobat create a index/catalog of all the material, this can be added to with each exhibition. Searching is very quick. Picks any word as long as the doc has been OCRed.
I use this method for Magazines some of which have been going for 100+ years. You could do every pdf file you have.
When searching the document comes up with every page the word is mentioned. For magazines I create a file for each year, you could do this for each exhibition .
Trust this makes sense.

yello
Posts: 5
Joined: 02 Jan 2016, 23:35
E-book readers owned: Sony
Number of books owned: 30
Country: Hong Kong SAR China

Re: Cataloging exhibition material - best way? (method/software)

Post by yello » 01 Dec 2017, 20:49

Well, I didn't come back for quite a while. My problem is still there. I have a new plan though.

I bought a book scanner and wait for delivery (Czur ET16)

Here is what I think I can do:

• Scan and OCR the catalogs (ET16 uses Abbyy Finereader)
• Save as readable PDF in a folder

Here the ET16 ends...

Then write a PHP page that:

• Extracts the text from the PDF
• Puts the text (without formatting) in a MySQL DB
• Link the DB TXT to that PDF
• Create a PHP page with fields for entering, editing etc.
• Have several fields to may add some keywords, source, company name, notes etc
• Create a search page

Now I can do a keyword search, or a full text search finding the catalog or page I want to find and bring up the PDF.

How does that sound?

BruceG
Posts: 63
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: Cataloging exhibition material - best way? (method/software)

Post by BruceG » 03 Dec 2017, 18:27

What you are trying to do is what Adobe Acrobat does already although at a great cost. It is the only program that indexes pdf files that I have found. Writing such a program is beyond me.
You can get a trial version for 30 days I think. So once you have scanned all your material, give it a try. As you would not have to index often, someone with Acrobat may do it for you.

I also have a ET16. The new model is better at scanning shiny paper as there are lights lower down, well that is what I have read.

A pdf editor may also may be helpful to make sure certain words are spelt right. (keywords, source, company name, notes etc) Searching will otherwise be difficult.

yello
Posts: 5
Joined: 02 Jan 2016, 23:35
E-book readers owned: Sony
Number of books owned: 30
Country: Hong Kong SAR China

Re: Cataloging exhibition material - best way? (method/software)

Post by yello » 11 Dec 2017, 20:26

@BruceG
Well, Acrobat costs something like US$100 - per year! If I can do that DIY I try that first.

So what did happen in the last few days:
  • I got the Czur ET16 from China
  • China models work only inside China (doh!) - don't buy in China!
  • I could make it work with an older software version
I have the normal model, not the PLUS one with additional lights. It scans OK as long as the paper isn't shiny. May targets are catalogs and brochures that are very often shiny. So my next stepp is to build some additional lighting to fix that.

The scans are 'OK', they are of course far worse than flatbed scans.

Can't say much about the OCR accuracy, only so much: it's not that relevant for me. I plan to grab the text and store the full text in a MySQL DB. There I can do a full text search very fast - and I I want bring up the original scan readable PDF. I don't plan to search through the searchable PDFs from the PC, that might be slow. MySQL is a fraction of a second.

Additional to the full text in the DB I plan to add relevant keywords. Like, in case I had a few 1000 car catalogs I would add: BMW, SUV - or something like that.

The actual grabbing and storing in MySQL I plan to do the the github pdfparser.org software. It works great on some PDF, however Czurs Abbyy saves PDFs in the format 1.5 which PDFParser can't handle. Easy way out to bundle that software with Githubs xthiago/pdf-version-converter and bring it down to V1.4

I have the PDFParser and pdf-version-converter already working separately. To put them together into one is easy. To write the PHP for putting text in MySQL and design a user interface page is rather easy too.

Would I buy the Czur ET16 again? Mhm, certainly not in China. I might go for the PLUS model next time. The shiny paper issue makes the normal model unusable on glossy paper.

BruceG
Posts: 63
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: Cataloging exhibition material - best way? (method/software)

Post by BruceG » 12 Dec 2017, 04:33

Yes Acrobat costs a bit, I use Acrobat 9 which does not have an annual fee. You can pay only one month at a time or yearly for the new version, the old one still does what I need.
I index sets of material separately ie. magazines, books sets and altogether (as most of what I do has the same subject) and usually search on people and locations. It is fast and the page the word or phrase is on comes up. What took weeks of reading can now be done in seconds. The index is searched not the pdf's.

As for the ET16, I got mine from China. To use the up to date software, you need the up to date firmware. Or it will not work.
Most of my material is done on a flat bed scanner. Recently I did some minute books from the 1800's that were falling apart, the ET16 was good for that in that further damage was limited.

Shiny paper is difficult. Turning the lights off and use another light source lower down is my suggestion. Or light through a window.

cday
Posts: 216
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: Cataloging exhibition material - best way? (method/software)

Post by cday » Yesterday, 15:33

BruceG wrote:
12 Dec 2017, 04:33
Shiny paper is difficult. Turning the lights off and use another light source lower down is my suggestion. Or light through a window.
Could a polarization filter, or possibly two filter sheets at 90 degrees with one in front of the light source, possibly provide a solution?

For example: Polarizing filter (photography)

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest