Page 1 of 1

K'ichee' book digitization

Posted: 28 Feb 2011, 14:36
by dvelleman
I'm building a scanner to help digitize some books in K'ichee', one of the modern Mayan languages of Guatemala. Until recently I'd been working with a flatbed scanner, but my patience is starting to run out...

Here's one of the successes --- this is from a history of the Catholic church in Guatemala, scanned on a flatbed, with very nice despeckling and pretty okay deskewing from a recent beta version of Scan Tailor.
500.png
And here's a tougher case. This is from Tzonob'al Tziij, a collection of speeches that used to be given during the preparations for a K'ichee' wedding. That fancy background image is based on a traditional weaving pattern. It's pretty, but man, it confuses the bejeezus out of Scan Tailor. I took this with an overhead camera but no platen. I'm hoping that adding a platen will give me a straighter image and let me skip Scan Tailor altogether (though it may be that OCRing these pages will be just as hopeless).

The three dots in the top right corner are the page number in hieroglyphic numerals. It's a base-20 system: two dots and then one dot means (2x20)+1=41. Nobody does everyday math with the old numerals anymore, but they're catching on for Serious Use in Documents Of Cultural Significance --- rather like Roman numerals in Europe.
tzonobal.png
This is the colophon from the same collection of speeches. It's the date when the print run ended, put into the Maya calendar and written out in hieroglyphs. (The last page of the book has the date the print run began, written out the same way.) The calendar is still in occasional use --- not for everyday timekeeping, but in traditional astrology and medicine, which never entirely died out. But really, again, this is more like a Latin inscription in an English book: whether or not you can read it yourself, it gives the whole project an air of gravitas.
colophon.png
colophon.png (431.95 KiB) Viewed 12419 times

Re: K'ichee' book digitization

Posted: 01 Mar 2011, 03:14
by StevePoling
since that background is a repeating pattern, couldn't you do some kind of 2D blind deconvolution to subtract it out?

Re: K'ichee' book digitization

Posted: 01 Mar 2011, 11:26
by daniel_reetz
If you can get sharper images - more like the first page with the hatched pattern - we can help you design a thresholding operation to extract the hatches. The basic formula would be to say "all pixels below X should be white" and "all pixels above X should be white" and go with that. Steve also has a point that it can be treated in frequency space, though I don't think that's necessary just yet - I think if you get reasonably "sharp" images with your camera that we can threshold this stuff out.

I'll return to the US in two weeks and can help - you might also have a look at the thresholding work going on here - there's a thread about red text/colored text that has a number of fairly advanced techniques for getting the work.

Re: K'ichee' book digitization

Posted: 01 Mar 2011, 11:37
by daniel_reetz
very quick example - color problems, but approaching something sane. in your case the number one thing is to get sharp images, because blur causes the borders of the letters to blend a little with the background. fortunately, it's not all that difficult.

Re: K'ichee' book digitization

Posted: 01 Mar 2011, 11:54
by Ryan_phx
Using Photoshop's threshold tool, you can get pretty good results. This is the output converted to pdf. This was just a quick demo, so I'm sure you could get better results.

Acrobat's OCR isn't too bad, but it sees "lo" as "10":
20 K'a te b'aa 10 k'uuta chi ri q'ani ama' ak' saqi ama' ak', mi xoto ta chi 10 riib' upa ri utolok' umaske'l, xuya chi k'u 10 jun roq'iyaal keb' roq'iyaal.
21 Te k'u ri' 10, ri q'ani atun ch'ok saqi atun ch'ok, mi xpe chi 10 sin ranima; xtz'itz'ot chi k'u 10 pa ri upurnum k'isiis. xtz'itz'ot chi k'u 10 pa ri upurnum paarki, karaj ne' k'oo 10 pa ri upurnum pa'chaj, k'oo 10 pa ri usook pa ri upache' k .
edit: This is what I get for taking my sweet time--Daniel beat me to it!

Re: K'ichee' book digitization

Posted: 01 Mar 2011, 12:26
by daniel_reetz
yeah, but yours is much better. :)

Re: K'ichee' book digitization

Posted: 01 Mar 2011, 15:29
by dvelleman
Those threshold examples look fantastic. Yes, I'll definitely give that a shot.

Re: K'ichee' book digitization

Posted: 01 Mar 2011, 15:52
by spamsickle
Mayan cocktail napkins. Who'd've believed it...