Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

OCR of printed tables

Convert page images into searchable text. Talk about software, techniques, and new developments here.
Post Reply
RMH
Posts: 3
Joined: 15 Feb 2015, 13:06
Number of books owned: 0
Country: USA

OCR of printed tables

Post by RMH » 16 Feb 2015, 17:09

I have been in search of a good OCR engine that can recognize printed tables without grid lines. I have used Adobe Acrobat and IRFANVIEW's Kadmos plugin but they have issues with reading all the way across rows.

My project is building an Access database of 1950's to 1960's outboard boat models using data gleaned from two series of vintage trade-in guides. I destructively scanned a few trade-in guides and are using these as my data source. These include Sheler's Outboard Boat Price Pilot Red Book, published annually from circa 1958 to the mid-1960s when they were acquired by their competitor, ABOS Publishing's Outboard Boat Trade-In Guide Blue Book. There were hundreds of small manufacturers of wood, aluminum, and fiberglass motorboats in this time period and there is a growing interest in collecting, restoring, and recreating in boats from this era.

I am building the definitive reference guide to these old boats using the model information available plus images drawn from original sales literature, modern photographs, and other sources. This will eventually be published onto the web in a searchable, query-driven research guide for public use. Probably housed at FiberGlassics.com, the largest online community of vintage boating enthusiasts, of which I am an editor of their wiki-style library section.
Attachments
Shelers_60_NEW_Page_08.jpg
1963_Bluebook_A_H_Page_76.jpg

cday
Posts: 227
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: OCR of printed tables

Post by cday » 16 Feb 2015, 19:40

RMH wrote:I have been in search of a good OCR engine that can recognize printed tables without grid lines. I have used Adobe Acrobat and IRFANVIEW's Kadmos plugin but they have issues with reading all the way across rows.

My project is building an Access database...
The answer is almost certainly Abbyy FineReader OCR program, and to output to spreadsheets or possibly word processor tables as an initial step (Nuance Omnipage is a possible alternative but the user interface is widely considered to be inferior).

Serious OCR programs are becoming increasingly complex so you should expect to have to invest some time in learning how to get the best out of whichever program you try. You should also expect to have to proofread the output carefully and where necessary to edit it, either in the OCR program or in the application to which you output.

It is not an output mode I'm very familiar with but I've done a quick test outputting to Excel using two of the available output options, with no optimisiation or proofreading, to give you some idea of what you might expect from Abbyy FineReader. Although your original scans were at 300dpi, which seems a reasonable resolution, the program suggested using a higher resolution to obtain better results. There are options available to select zones to be read which I haven't used, which might provide output better suited to your needs.
Shelers_60_NEW_Page_08_Left_Formatted_text.xls
(27 KiB) Downloaded 247 times
Shelers_60_NEW_Page_08_Left_Editable_copy.xls
(27 KiB) Downloaded 211 times
I should emphasis that this was only a quick test using modes I am not familiar with, but maybe someone else can add some useful input... ;)

RMH
Posts: 3
Joined: 15 Feb 2015, 13:06
Number of books owned: 0
Country: USA

Re: OCR of printed tables

Post by RMH » 16 Feb 2015, 20:37

Thank you, I was curious about the performance of AABBYY FineReader before spending the $120 on a license!

Mostly I have been importing into Microsoft Excel from text files using the built-in text import tool. That allows special character or space delineation, which isn't always optimal. Lots of editing necessary. Obviously I would like to greatly speed the process and conserve time just on editing. Previously I imported column by column as that's how the OCR engines I have used recognized the text. Going straight to Excel will be a considerable time-saver.
Attachments
Capture1.JPG
Capture.JPG

BruceG
Posts: 67
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: OCR of printed tables

Post by BruceG » 18 Feb 2015, 01:15

I use OmniPage for OCR. I have done tables in text pages before, but never just tables. OmniPage's output spread the material over 4 sheets. Why I do not know as I have never set output in excel before.
I also outputed to pdf as well.
excel
Shelers 60 New Page 08 OmniPage.xlsx
(12.63 KiB) Downloaded 225 times
pdf
Shelers 60 New Page 08 OmniPage pdf.pdf
(84.07 KiB) Downloaded 255 times
I use pdf as I use Acrobat to Index a large number of files so they all can searched in a second or two with Adobe reader.

cday
Posts: 227
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: OCR of printed tables

Post by cday » 18 Feb 2015, 12:32

Nuance OmniPage was mentioned in my first post as a possible alternative to Abbyy FineReader, and although FineReader 12 is my usual OCR program, I also have OmniPage 18 which still seems to be the current version, although I thought a newer version with possibly only minimal changes was launched a while back.
BruceG wrote:OmniPage's output spread the material over 4 sheets. Why I do not know as I have never set output in excel before.
Delving into the OmniPage settings, this seems to be the relevant setting:

Tools > Saving Preferences... > Text Converters > Excel > Formatting level

Changing 'Formatted Text' to 'Spreadsheet' produces a single sheet output for my single-page test file:
Shelers_60_NEW_Page_08_Left_OP18_Spreadsheet.xls
(18.5 KiB) Downloaded 216 times
OmniPage processed the test file quickly, and without complaint at the original 300DPI image resolution, producing output that seems as good as, or actually slightly better than, the FineReader output above in terms of recognition accuracy and the preservation of formatting.

The general preference for Abbyy FineReader among reviewers and users probably relates mainly to the better user interface of the former, although the addition of new features in recent versions has inevitably somewhat increased the complexity of the interface. But OCR is a demanding application technically, and obtaining the best from either program requires careful study and a certain amount of experimentation.

Based on the above tests, it may be that OmniPage is marginally the better program for processing tables, although further tests would be needed to confirm that, and the user interface may present more of a challenge for many users.

Edit

Some PCMag.com reviews by different authors:-

OmniPage Professional 18 by M.David Stone:

http://www.pcmag.com/article2/0%2c2817% ... 3%2c00.asp

Abbyy FineReader 12 Professional by Edward Mendelson:

http://www.pcmag.com/article2/0,2817,2468186,00.asp

OmniPage Ultimate (a newer, higher price version of OmniPage) by M.David Stone:

http://www.pcmag.com/article2/0,2817,2422927,00.asp

RMH
Posts: 3
Joined: 15 Feb 2015, 13:06
Number of books owned: 0
Country: USA

Re: OCR of printed tables

Post by RMH » 19 Feb 2015, 00:52

Thanks for the detailed comparison. Acrobat just won't cut the mustard in recognizing these tables. The PDF created with OmniPage exported very well into Excel using Acrobat, so it has to be the Acrobat OCR's fault in recognizing the table and applying the appropriate formatting to the stored text data.

BruceG
Posts: 67
Joined: 14 May 2014, 23:17
Number of books owned: 500
Country: Australia

Re: OCR of printed tables

Post by BruceG » 20 Feb 2015, 02:11

OmniPage can usually be found around the $100 mark.
Acrobat does a reasonable job with new books. The only advantage is that it keeps the original scan/photo and creates a text layer on top. So gives the appearance and accuracy of the original. Copy and paste the text into a word processor, this will give a more accurate picture. Acrobat does not allow to fix errors as OCR programs, or a pdf editor does.

myfreeocr
Posts: 4
Joined: 02 Apr 2015, 06:26
Number of books owned: 0
Country: India
Contact:

Re: OCR of printed tables

Post by myfreeocr » 07 Apr 2015, 09:46

Another way to Achieve this is. Convert the images into PDF files using any online tool. Open that PDF file in Microsoft Reader (which comes inbuilt with The windows 8) Select the text you want and Copy Paste. Simple as that ans it will save you lots of trouble of Buying software license and Compatibility or suitable settings.

cday
Posts: 227
Joined: 19 Mar 2013, 14:55
Number of books owned: 0
Country: UK

Re: OCR of printed tables

Post by cday » 07 Apr 2015, 11:48

myfreeocr wrote:Another way to Achieve this is. Convert the images into PDF files using any online tool. Open that PDF file in Microsoft Reader (which comes inbuilt with The windows 8) Select the text you want and Copy Paste. Simple as that ans it will save you lots of trouble of Buying software license and Compatibility or suitable settings.
The first post in this thread contains two example table images, would you care to download them and then post the Excel charts obtained using the process you suggest?

L.Willms
Posts: 130
Joined: 21 Sep 2016, 10:51
E-book readers owned: Tolino Shine
Country: Germany
Location: Frankfurt/Main, Germany

Re: OCR of printed tables

Post by L.Willms » 23 Apr 2018, 01:13

While this is old stuff, and the original poster has probably built his database, I want to add for others consulting this thread that in ABBYY FineReader one can define and store Area Templates, thus defining the whole page or parts of it as Table or Text or Image, and as page header, as footnote etc. If this is useful when the page consists of several tables which have to be analysed as individual tables, is another question. It took me some time (years) to find out this feature.

Post Reply