Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

Open Source OCR, an ABBY alternative?

Convert page images into searchable text. Talk about software, techniques, and new developments here.
E^3
Posts: 41
Joined: 12 Jul 2010, 21:06

Re: Open Source OCR, an ABBY alternative?

Post by E^3 » 17 Aug 2010, 21:33

Hi folks,

In my point of view, there some definite answer to it.

1) An author has the will to share it for free
2) More head /developers contributing certain project is better than lesser or one developer to achieved such goals
3) One reason might be is the potential to be outsourced /sponsored /donated by some large software foundations or commercial
software company.
4) Being in an Opensource sympathizers help break the barrier of digital monopoly(that sometimes affect the development of a 3rd world company)

So good to have this "DIY Book Scanning" forum; it helps alot.



Thanks

E^3
Philippines

BookReader
Posts: 3
Joined: 04 Mar 2014, 00:52

Re: Open Source OCR, an ABBY alternative?

Post by BookReader » 19 Aug 2010, 17:49

There are many reasons why the use and promotion of open source software is good. For instance:

1. not recreating the wheel
2. creating software for societal good that is usable by all
3. not a "black box", if something breaks it can be fixed, not like software which is proprietary for which one cannot in general fix the problem nor even get a glimpse at what could be going wrong in the first place.

There are others but I'll stop with the above. The big question that remains about OSS is can it be an economic vehicle for individual, national, or Earthwide (maybe this is going too far) prosperity. The basic change of currency in this model changes as noted above from products, to services provided. For example instead offering a product, Abbyy word professional ocr suite for instance, instead the service provider offers a service, training Ocropus, or Tesseract for instance, to do what the client desires. Much of the software available today is of a highly complex nature, as such specialization is required implying there is no obvious reason why this model is bound to fail. What is required for this model to be successful is that the public understands the potential and existence of incredibly powerful, and/or useful oss. This has already been accomplished, Ubuntu, Android, Ooo, MySQL, etc.. I believe the notion of a services based software movement could quickly turn into a viable and healthy - for many reasons healthy - Worldwide market in the near future.

Sorry for further turning this thread into an oss debate and less of, "these are the tools available for os ocring", but I think this dialogue is necessary. I just built the latest Tesseract, 3.0.0, it is quite good, though as stated before by others it is only a single tool where others are needed as it is only capable of single column ocring. I've almost built the latest Ocropus and am very excited to see what capabilities it has. All of these can be trained of course so I'm not sure if first attempts really give the best picture as to the potential of the software.

Tim

Re: Open Source OCR, an ABBY alternative?

Post by Tim » 19 Aug 2010, 22:31

BookReader wrote:I just built the latest Tesseract, 3.0.0, it is quite good, though as stated before by others it is only a single tool where others are needed as it is only capable of single column ocring. I've almost built the latest Ocropus and am very excited to see what capabilities it has. All of these can be trained of course so I'm not sure if first attempts really give the best picture as to the potential of the software.
Sort of good news. The development version of Tesseract added some layout analysis code, so you may want to try that out. The problem is it doesn't seem to do a very good job on what I've given it so far. I think it's biased towards treating things as columns, but the text I gave it really should have been treated as rows. The good part is two column text should be a breeze now. Ocrpus has a lot of promise, but it seems to have stalled a bit. We seem to have three viable progressing projects though with Tesseract, Ocropus, and Cuneiform, so that's a good thing. Too bad it doesn't seem easy to share code between them as they seem to be all working on the same problems separately. I think it's a case where the state of the art would probably be a little ahead if the relatively few coders with the skills to work on OCR software and algorithms joined forces a bit more, but maybe not.

Post Reply