Proofreading OCR text (text under the image)

Convert page images into searchable text. Talk about software, techniques, and new developments here.

Moderator: peterZ

Post Reply
seasalt

Proofreading OCR text (text under the image)

Post by seasalt »

Hello, does anyone know of a trick/method to enable easy proofing of OCR text, possibly with ImageMagick or some other image manipulation tool. Adobe Acrobat X (mac) option to review OCR suspects one by one is slow.

This is my manual process to enable the proofreading (looking for a tool to automate)

Page by Page process

1 make pdf page double width (so I will be able to view the "text" and "image of text" side by side
(note: none of the ocr engines ive tested actually use featue of acrobat, "Layer" (e.g. layer text, layer image ), rather both text and image are in the same (one) layer)

2 using Adobe Acrobat X plugin (enfocus PitStop)
select page contents, change FILL to ON (the text is invisible/hidden/transparent as font fill is OFF)
(this gives the glassy 2 layer look, but not really layers)

3 unselect contents, then move the "image/bitmap" to right next to text

then I edit the OCR'd text as follows:
1 spellchecker
2 remove scannos (common scanning errors e.g. [ for J, 3 for S etc...
3 fix text because of my underlining in the book

NOTE: this method is for searchable image (EXACT) option selected in Adobe acrobat X (mac) OCR engine or ABBYY or READIRis
it does not work for AAX clearscan option, as clearscan is a different OCR method (it creates a new font, type 3, so editting is crazy crazy territory.. as fonts in PDF lands are not markup languages, they a blobs on a page)

4 then I either delete the image or (return to as-is e.g. step 3, 2)
5 return page size
Digitizer
Posts: 9
Joined: 18 Jan 2011, 11:58

Re: Proofreading OCR text (text under the image)

Post by Digitizer »

Hi,

first thought was: have a closer look at fire-text.

I saw that Firefox-Plugin a while ago and perhaps it is partly what you are looking for. As i understood the description right, it let you have (ocr'd) text and (scanned) image files in two directories, loading both into your browser and then let you edit the text files while compairing text and image.

Cheers,
Marcus.
seasalt

Re: Proofreading OCR text (text under the image)

Post by seasalt »

thankyou marcus.

the text and image are in the same file (pdf).
the text is underneath the image.

I use PDF to retain all the formatting, and ultimately I want a PDF to read my book.
---
I will check the link out as I do use Firefox. thankyou
Post Reply