Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

html conversion: "<b />" vs. "<p></p>"

Convert page images into searchable text. Talk about software, techniques, and new developments here.
Post Reply
User avatar
Posts: 82
Joined: 06 Jan 2011, 00:55

html conversion: "<b />" vs. "<p></p>"

Post by jimboh » 18 Dec 2012, 18:43

I use a Mac. I have Adobe Acrobat X Pro. I need to go from a PDF file to an HTML file in preparation for the final editing to create an ebook using Calibre (i.e. import the HTML file into Calibre and use Calibre to convert to the desired ebook format).

In short, I need an easy way to get to an HTML file with each paragraph marked off with <p> at the beginning and </p> at the end. No matter what I do with Acrobat and Word for Mac, I get an HTML file with paragraphs created solely by ending blocks of text with <b />. Nothing at the beginning of a paragraph; just <b /> at the end. I see no easy way to convert the latter HTML structure to the former. I lack the energy to change the code manually for each and every paragraph in, say, a 300-page book.

For the Mac, does anyone know of software that can create an HTML file from an OCRed PDF (or a PDF saved as a Word file) with each paragraph delineated by <p>...</p> and not ...<b />?

Please, no solutions that treat every line of a paragraph as a separate paragraph!

P.S. I can handle basic find-and-replace with TextWrangler, BlueGriffin, or similar editing software. Regular expressions and unix-like scripting are beyond my abilities. I have Clean Text from the App Store, but the absence of help for all the commands makes me nervous to use it.

Posts: 18
Joined: 22 Dec 2011, 20:00
E-book readers owned: kindle
Number of books owned: 4000
Location: Nr. London, UK

Re: html conversion: "<b />" vs. "<p></p>"

Post by stearn » 22 Jan 2013, 20:35

Only just getting to grips with Acrobat X Pro myself, but if you are batch OCRing via an action then you can change the output options on the Save to section, just click the Export File(s) to alternate format, and choose HTML from the dropdown. I don't know what coding you might get, but it has got to be a start as simple find/replace can always be used to remove tags you don't want.

Posts: 596
Joined: 06 Jun 2009, 23:57

Re: html conversion: "<b />" vs. "<p></p>"

Post by spamsickle » 30 May 2013, 08:05

Sorry for the late reply, but I'm just starting to be interested in OCR.

It would seem easy enough to write a script to replace every instance of <b/> with </p><p>, or even "replace all" with a text editor. Then, all you have to do is add a <p> at the front of the file, and strip a <p> from the end.

Post Reply