Need fast help to TiF -> Searchtable PDF (OCR) SWEDISH

Convert page images into searchable text. Talk about software, techniques, and new developments here.

Moderator: peterZ

Pelle
Posts: 13
Joined: 12 Apr 2012, 13:40
E-book readers owned: Asus Transformer Prime
Number of books owned: 14
Country: Sweden

Need fast help to TiF -> Searchtable PDF (OCR) SWEDISH

Post by Pelle »

As stated in topic I have big problems after I have edited the pics in scan tailor and now I wanna be able to to so all text in the pages are searchable in PDF and also make the background white (the books background is slighty yellow).

This is a simple question on wich app is best for Swedish words. What apps and (free) dictionarys do u use for Swedish langauge in books (OCR) ??

Very fast answers are apprechiated becuase I have a university exam Very soon and I wanna be able to read lots and lots on my pad instead fo carrying x KG books always...

Best regards, :oops:
User avatar
Heelgrasper
Posts: 70
Joined: 19 Feb 2012, 21:04
E-book readers owned: None
Number of books owned: 500
Location: Randers, Denmark

Re: Need fast help to TiF -> Searchtable PDF (OCR) SWEDISH

Post by Heelgrasper »

I've used Tesseract and PDFbeads (via Homer, http://bookscanner.pbworks.com/w/page/4 ... 20software) on Danish texts with what looks like good results so that might work for Swedish too. Doesn't matter anything for reading, just a question on how likely it is that a search will find all the right stuff or if you want to copy-paste something from the book. And as such just an added feature compared to the printed book.

Making the background white would (as far as I know) be something to do in ScanTailor by setting the output to bitonal (b/w). Or mixed if there are illustrations.
---
Jakob Øhlenschlæger
Randers, Denmark

The past is a foreign country: they do things differently there
L. P. Hartley
Pelle
Posts: 13
Joined: 12 Apr 2012, 13:40
E-book readers owned: Asus Transformer Prime
Number of books owned: 14
Country: Sweden

Re: Need fast help to TiF -> Searchtable PDF (OCR) SWEDISH

Post by Pelle »

I've just did a quick try with Tesseract and vietOCR GUI to that with both swedish and swedish "fraktur" (wtf now fraktur is i dont know..). Anyway. It didnt go so very well, maby it is becuase the page is a littbit tilted u think?

I found the thingy in Scan Tailor you mentioned (to get the bg more white:isch) and it worked fine. It was called "Equilize illumination" and was found on the last "process page" (aka, where you are about to create the actual TIF files) so thanks alot for that.

If anyone have agood (great) app for getting this swedish medicine language books correctly OCR:ed PLEASE write here.

Best regards,
User avatar
daniel_reetz
Posts: 2812
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: Need fast help to TiF -> Searchtable PDF (OCR) SWEDISH

Post by daniel_reetz »

Hi pelle, I understand English may be your second language, so communication might be a little tough. Generally we don't care about perfect English here but a few things will help you get a better answer.

1. Don't rush people - everyone here already helps as fast as they can.
2. Please post an example page and a clear description of your problem. For example you say the page is a little bit tilted - well, we can't see that -- so we can't help.
3. Please post about your computer. Windows, Linux, Mac? What have you tried that DID work? Also tell us what you've tried that DIDN'T work, so we can figure out what went wrong.

Heelgrasper is very knowledgeable and very well practiced with difficult texts so please heed his advice and spend some time with that software and that approach.

Thanks,
The Management ;)
Pelle
Posts: 13
Joined: 12 Apr 2012, 13:40
E-book readers owned: Asus Transformer Prime
Number of books owned: 14
Country: Sweden

Re: Need fast help to TiF -> Searchtable PDF (OCR) SWEDISH

Post by Pelle »

Hello Daniel.

Im sorry for the rush. I didnt ment to me disrespectful in any way. I just wanna get pass this last exam and start my summer work without having to redoo the exam in august :p
I have uploaded two exact same pictures with TinyPic now for you to look at, one is called "whitish.tif" and the other "darker" one is called "normal.tif". As mentioned above I edited it as Hallgard said. (to get one of em whiter).

With Tesseract and VietOCR3 (GUI for Tesseract on windows) I get this when OCR on the...

Whitish.tif..:

[img]0http://i47.tinypic.com/op1r0y.jpg[/img]

wuswm
rama 1
um.-iw v
»eu nn-|»=a.|«-mlhvp-I-Ann.-||;«||" ~ '
mm» uum=.1=| 1:
v..a.....\.».=mw u
L.|....Mm~m u
|;=.,.1.....,,;.,,... M
M....,.m......,-.« .»
mmm man-mm. 11
m.N.,,....,.fi...,,. <1
o1.».w,,...w...1m.N... .-
o1.»<.«.m..<.1.,..,.1.,...u... W
o,«.“...,...|..,.d|.,.; ..
mm »,.«|.~........m. W
»<mm> uvgm-1=|mmfing 1:
ß<,..,,,,. r..,m....|:..|>|.|.....,.1.m«.,...,1,., L.
A.1.m..§«m,.v|.|.....,.«1| 1.
mm . ms_«.nm»»«»<|...1-p.»4.m..u U

-------------------------------------------------------------------------------------------------------------------

An this on the normal.tif..:

[img]0http://i49.tinypic.com/1jkzlt.jpg[/img]

|NNEHALL
Förord 7
Inledning 9
KAPITELI Läkemedel 13
vadärenläkemedelï 1;
Laxrrrrrrarlmrrrrr rr
ßeredmrrgrform rr
Adminiszmiornsärr 14
KAMTEL2 Ordination 11
Behërigz an ordinera 17
o1i1r=ryp=r av rrraanrrmrr ra
olika former för ordinazion rg
orairramnshflnalmg 10
Hur en ordirmirm slrrivr 10
KAPHEL3 Läkemedelshantering 23
Begrepp 1 far|ra11.rra= rm lalrrrrrmlslranrrrarrrr 13
Aamarrisrrermg =v1ä|r=rrr=a.| 1,,
KAPITEL 4 FASS - en handbok och en uppslagsverk 27
. mrrrrmrr om rrwrrrrrrrrrmur
J


If u compare to the picture it isnt even close to the words in the book/picture. And there is where my problem is.
And yes. You can proberbly say that my second language is English. Hope it isn't to many misspellings ;(

Best regards,
User avatar
daniel_reetz
Posts: 2812
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: Need fast help to TiF -> Searchtable PDF (OCR) SWEDISH

Post by daniel_reetz »

No worries. ;) Especially about spelling ;)

It almost seems like the original pictures might be too low-resolution for the OCR engine. How big are the original images in pixels? Can you post a section of a page at the original size?
Pelle
Posts: 13
Joined: 12 Apr 2012, 13:40
E-book readers owned: Asus Transformer Prime
Number of books owned: 14
Country: Sweden

Re: Need fast help to TiF -> Searchtable PDF (OCR) SWEDISH

Post by Pelle »

The original picture was 3888x2592 px but when I took the pics I had like 10-12 cm "left out". AKA the picture where the pages are are: 1176x1764 (when cropped so only the page(s) are seen and not my floor).
It is taken by a EOS 400D so I have the posibility to take all pics in RAW mode but it seemed unnecesary big.

Maby I should just zoom in a bit more then so the pages are like 3888x2592 instead of 1176x1764 when fixed. Alsi I remember that Scan Tailor asked me to crop the files so I wrote 1200x1200 in the scan tailor box. Maby take the picture as 3888x2592 as I (you) said and then in scan tailor state 2000x2000 instead of 1200x1200?

;| :)
User avatar
daniel_reetz
Posts: 2812
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: Need fast help to TiF -> Searchtable PDF (OCR) SWEDISH

Post by daniel_reetz »

That's exactly right. Fill the image with the page so all your pixels represent the book. Zoom in on that thing! Your OCR will improve VERY quickly. Try it with just one page, you'll see!
abmartin
Posts: 79
Joined: 15 Sep 2010, 15:33
Number of books owned: 2000
Country: USA
Location: Ohio

Re: Need fast help to TiF -> Searchtable PDF (OCR) SWEDISH

Post by abmartin »

As Dan says, the better the picture, the better your results will be.

When doing OCR, I find a significant increase in quality if I get over the 300 dpi line. I too use scantailor before doing OCR.

One thought, you mention that you were manually entering a value of 1200x1200. If you are doing that at the beginning of the entire process, that is definitely not correct. At the beginning, ScanTailor asks for the number of dots per inch. (2.54 cm) To correctly determine the DPI of an image, I like to take a photo with a ruler. I can then measure that ruler with GIMP's measuring tool. (I expect most image editing software has this capability) If you did enter 1200x1200 at the beginning of the process, you told scantailor that the image was less than two inches wide. (~5 cm) That information gets encoded in the final image. Tesseract might then be very confused by the size of the image, trying to read text less than a mm in height.

If you entered that number at the end of the process for the output DPI, I find that unnecessary too. Doing that is asking Scantailor to create pixels. It does make the images look smoother on a screen, but I find that 300 or 600 final DPI gets a better result with Tesseract.


Responding to an earlier question, Tesseract's Swedish Fraktur isn't going to be helpful on that image. Fraktur is an old style of writing that died out in Scandenavia by the early 20th century. (The Germans held on a bit longer) The standard Swedish language is what you will want to use since it is in a Roman typeface. https://sv.wikipedia.org/wiki/Frakturstil
Pelle
Posts: 13
Joined: 12 Apr 2012, 13:40
E-book readers owned: Asus Transformer Prime
Number of books owned: 14
Country: Sweden

Re: Need fast help to TiF -> Searchtable PDF (OCR) SWEDISH

Post by Pelle »

Abmartin. let me just say three words. You are wonderful! :p

It worked much better (not 100% but atleast 60-70%) when I didnt enter 1200x1200 in the beginning when ScanTailor asked me to input size. I instead just choosed 600dpi and I did the page whitish in the end before saving the files.

I dont understand that thing with inch, not even a bit. I dont know any other scandinavian that know inch either, we always use: Millimeter, decimeter Centimeter and Meter. ( 10 mm = 1 cm | 10 cm = 1 dm | 10 dm = 1m ) ^^

Maby this is on to get sticky for scandinavians? :p


Again thanks alot, Heelgrasper, daniel_reetz and abmartin!

:mrgreen: :D

btw I was looking for a button "solved" but dont find any ;(
Post Reply