Daniel Reetz, the founder of the DIY Book Scanner community, has recently started making videos of prototyping and shop tips. If you are tinkering with a book scanner (or any other project) in your home shop, these tips will come in handy. https://www.youtube.com/channel/UCn0gq8 ... g_8K1nfInQ

Paragraph issues

Discussions, questions, comments, ideas, and your projects having to do with DIY Book Scanner software. This includes the Stereo Data Maker software for the cameras, post-processing software, utilities, OCR packages, and so on.
Post Reply
sassanik
Posts: 12
Joined: 04 Mar 2014, 00:53

Paragraph issues

Post by sassanik » 20 Jan 2011, 10:08

Okay so I got my images scanned ran them through Abbyy Finereader, got a fancy new word document.

One of the issues I am having is that when I change the font the paragraphs don't resize correctly.

ie it turns out something like this:


veloped a simple boms treatment which has
brought new bone, health and happiness to
many thousands. Many who had beeo child-
less for years became proud and happy
Mothers. Husbands have written mo the


when it should be like this:

veloped a simple boms treatment which has brought new bone, health and happiness to many thousands. Many who had beeo childless for years became proud and happy Mothers. Husbands have written mo the

I am sure this problem has a fancy name, but I have no clue what to call it.

Is there a efficient way to fix this? A macro or something? I have normally use Word but have Open Office installed.

Thanks!

Sassanik

Anonymous1

Re: Paragraph issues

Post by Anonymous1 » 20 Jan 2011, 11:26

What you're looking for is Word Wrapping. Like this website; when you resize your browser window, the text shifts position to accommodate.

If you would be okay with making a HTML book, then that's easy to do. I know DjVu doesn't support word wrapping, and neither does PDF (I think). If not, I'm out of ideas.

User avatar
reggilbert
Posts: 49
Joined: 28 Sep 2010, 19:57
Number of books owned: 3000
Location: Buffalo, New York

Re: Paragraph issues

Post by reggilbert » 20 Jan 2011, 11:51

Presumably this is the Abbyy program translating the end of each line on the page as a "hard return," that is, the same as the end of a paragraph, rather than the "soft return" that word processors use to wrap lines that have reached the margin. Soft returns automatically move as the margin or font size changes the end of the line. Hard returns stay, and maintain the appearance of the end of a paragraph, no matter what editing or font-size changing takes place.

So you probably have an invisible "hard return" character at the end of all the lines in your document. The good news is that you can probably search and replace for this character. The bad news is that, if the character is the same for ends of paragraphs as it is for the ends of lines within paragraphs, it would make it impossible to do a global search and replace, as all your paragraphs would run together.

Two possible solutions.

One might be in Abbyy FineReader. Surely after all these years the program has the capability of distinguishing between paragraph endings and simple wrap endings. See if there might be an option you can set that will enable this capability. If there is no such obvious setting, the option might be automatically selected by selecting an output setting that implies it. For example, if you can command the program to output an RTF file ("rich text format," a nearly universal format that contains many of the formatting options missing in TXT (just text and line returns) files. Such output choices maybe include switching on a possible FineReader wrap recognition capability.

Even the file you currently have may in fact distinguish between paragraph and line endings, but does not put in soft returns because of a lack of a universal character for them (I am speculating on that). In MS Word, these can be searched and replaced with the ^p and ^l, two codings for the special characters causing the distinct end of paragraph and end of line functions. If Open Office has the capability to search for and replace special characters, it may use different codes to stand in for paragraph and line endings. If you output file, or the output from some other FineReader output choice, does make this distinction, you simply search for the line ending character and replace with a space.

Another solution is less elegant, but could work even if FineReader cannot distinguish between ending types. If the beginnings of the paragraphs in your file have a unique characteristic, like a tab, or five spaces, you can globally search for and replace the combination of them and their preceding paragraph endings with any unique phrase, like "yyy." So, in Word, with, say, five spaces being the output files way of representing the beginning of a pragraph, the search field would read "^p ". The only problem to watch for is non-paragraphs that look like that, for example, tables. Now your file woudl have all the paragraphs running together, but separated by this unique characater. Then you would search for all the remaining paragraph endings , which now should be only those objectionable hard returns inside paragraphs, and replace them with a space (unless there is already a space at the end of all those lines). Then you would search and replace the unique phrase with something that restores the paragraph appearance, like the return (paragraph ending) character plus the tab character (^p and ^t in Word)

I see Abbyy Finereader mentioned often on this forum, I imagine quite a few forum members use Open Office as well, and likely this hard-return problem is pretty common too, so let us know if you come up with a solution.

sassanik
Posts: 12
Joined: 04 Mar 2014, 00:53

Re: Paragraph issues

Post by sassanik » 20 Jan 2011, 19:11

So in theory I need some sort of a macro that if ^p then replace with (space) unless ^p^p then (nothing)

It seems to be a common problem, lots of people posting about it, but darned if I found any answers that I actually understand and can implement!

Sassanik

sassanik
Posts: 12
Joined: 04 Mar 2014, 00:53

Re: Paragraph issues

Post by sassanik » 20 Jan 2011, 19:28

okay so I think I figured out how to do this

Run Find Replace-

^p^p replace with ^t

Then run

^p replace with (space)

Then

^t replace with ^p^p

the problem I am having is in step 2 where I replace ^p with a (space) it removes the space entirely and leaves the word as a run on, ie "haddiscovered" if I add two spaces though I get "had(space)(space)discovered" *scratches head*
Is there a step I am missing or needs to be added in?

Sassanik

Anonymous1

Re: Paragraph issues

Post by Anonymous1 » 20 Jan 2011, 19:39

Hmm, try replacing ^p^p with asdfghgfdsa (or something that will never appear in a book) and replacing that instead of ^t.

Wait, did you select "Use Wildcards"? That might make it work.

sassanik
Posts: 12
Joined: 04 Mar 2014, 00:53

Re: Paragraph issues

Post by sassanik » 20 Jan 2011, 23:48

Okay got it, I think!

Find Replace as follows

^p^p with ^t

^p with ^s

^t with ^p^p

the ^s is a non breaking space, it fixes the double space problem that I was having.

Sassanik

Mandor
Posts: 24
Joined: 28 Jul 2009, 01:27
E-book readers owned: lBook V8, lBook V3
Number of books owned: 0
Location: Sofia, Bulgaria

Re: Paragraph issues

Post by Mandor » 21 Jan 2011, 02:23

Yes, there is option in FR, called "Keep line breaks". You must turn it off. But even when this option is on, the character at the end of line is "soft return" (also "end of line"), not "end of paragraph".
FR recognizes paragraphs very well (not perfect, but very well), so you don't need to remove false new lines "by hand".
And finally — your algorithm for removing doesn't handling with dashes and optional hyphens at the end of line (child-less in your example).

sassanik
Posts: 12
Joined: 04 Mar 2014, 00:53

Re: Paragraph issues

Post by sassanik » 21 Jan 2011, 09:31

It is good to know that Fine Reader has that option. Alas I have the cheapie "express" version, which does not allow such advanced settings.

I macro'd my little find replace series, and it seems to work well so far.

It is true that my algorithm does not handle dashes, I suppose I could add in something that is like

Find - replace with (space)

But I would worry that would get rid of my em dashes.

This new macro will at least cut down the time I need to spend by a pretty significant amount I think.

Most of the documents I am scanning have to be double checked anyway, due to poor editing to start with in addition to scanning errors that occur during the OCR process.

Thank you guys for offering suggestions and help!

Sassanik

Post Reply