K'ichee' book digitization

A place to tell us about your work and projects. Self-links encouraged!

Moderator: peterZ

Post Reply
dvelleman

K'ichee' book digitization

Post by dvelleman »

I'm building a scanner to help digitize some books in K'ichee', one of the modern Mayan languages of Guatemala. Until recently I'd been working with a flatbed scanner, but my patience is starting to run out...

Here's one of the successes --- this is from a history of the Catholic church in Guatemala, scanned on a flatbed, with very nice despeckling and pretty okay deskewing from a recent beta version of Scan Tailor.
500.png
And here's a tougher case. This is from Tzonob'al Tziij, a collection of speeches that used to be given during the preparations for a K'ichee' wedding. That fancy background image is based on a traditional weaving pattern. It's pretty, but man, it confuses the bejeezus out of Scan Tailor. I took this with an overhead camera but no platen. I'm hoping that adding a platen will give me a straighter image and let me skip Scan Tailor altogether (though it may be that OCRing these pages will be just as hopeless).

The three dots in the top right corner are the page number in hieroglyphic numerals. It's a base-20 system: two dots and then one dot means (2x20)+1=41. Nobody does everyday math with the old numerals anymore, but they're catching on for Serious Use in Documents Of Cultural Significance --- rather like Roman numerals in Europe.
tzonobal.png
This is the colophon from the same collection of speeches. It's the date when the print run ended, put into the Maya calendar and written out in hieroglyphs. (The last page of the book has the date the print run began, written out the same way.) The calendar is still in occasional use --- not for everyday timekeeping, but in traditional astrology and medicine, which never entirely died out. But really, again, this is more like a Latin inscription in an English book: whether or not you can read it yourself, it gives the whole project an air of gravitas.
colophon.png
colophon.png (431.95 KiB) Viewed 18710 times
StevePoling
Posts: 290
Joined: 20 Jun 2009, 12:19
E-book readers owned: SONY PRS-505, Kindle DX
Number of books owned: 9999
Location: Grand Rapids, MI
Contact:

Re: K'ichee' book digitization

Post by StevePoling »

since that background is a repeating pattern, couldn't you do some kind of 2D blind deconvolution to subtract it out?
User avatar
daniel_reetz
Posts: 2812
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: K'ichee' book digitization

Post by daniel_reetz »

If you can get sharper images - more like the first page with the hatched pattern - we can help you design a thresholding operation to extract the hatches. The basic formula would be to say "all pixels below X should be white" and "all pixels above X should be white" and go with that. Steve also has a point that it can be treated in frequency space, though I don't think that's necessary just yet - I think if you get reasonably "sharp" images with your camera that we can threshold this stuff out.

I'll return to the US in two weeks and can help - you might also have a look at the thresholding work going on here - there's a thread about red text/colored text that has a number of fairly advanced techniques for getting the work.
User avatar
daniel_reetz
Posts: 2812
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: K'ichee' book digitization

Post by daniel_reetz »

very quick example - color problems, but approaching something sane. in your case the number one thing is to get sharp images, because blur causes the borders of the letters to blend a little with the background. fortunately, it's not all that difficult.
Attachments
tzonobal_quickly_done_with_levels_tool_in_photoshop.png
tzonobal_quickly_done_with_levels_tool_in_photoshop.png (408.21 KiB) Viewed 18680 times
Ryan_phx
Posts: 63
Joined: 29 Dec 2010, 14:51
E-book readers owned: Nook, Kindle DX
Number of books owned: 0
Country: USA
Location: Sandusky, OH

Re: K'ichee' book digitization

Post by Ryan_phx »

Using Photoshop's threshold tool, you can get pretty good results. This is the output converted to pdf. This was just a quick demo, so I'm sure you could get better results.

Acrobat's OCR isn't too bad, but it sees "lo" as "10":
20 K'a te b'aa 10 k'uuta chi ri q'ani ama' ak' saqi ama' ak', mi xoto ta chi 10 riib' upa ri utolok' umaske'l, xuya chi k'u 10 jun roq'iyaal keb' roq'iyaal.
21 Te k'u ri' 10, ri q'ani atun ch'ok saqi atun ch'ok, mi xpe chi 10 sin ranima; xtz'itz'ot chi k'u 10 pa ri upurnum k'isiis. xtz'itz'ot chi k'u 10 pa ri upurnum paarki, karaj ne' k'oo 10 pa ri upurnum pa'chaj, k'oo 10 pa ri usook pa ri upache' k .
edit: This is what I get for taking my sweet time--Daniel beat me to it!
Attachments
tzonobal copy.pdf
(94.9 KiB) Downloaded 799 times
User avatar
daniel_reetz
Posts: 2812
Joined: 03 Jun 2009, 13:56
E-book readers owned: Used to have a PRS-500
Number of books owned: 600
Country: United States
Contact:

Re: K'ichee' book digitization

Post by daniel_reetz »

yeah, but yours is much better. :)
dvelleman

Re: K'ichee' book digitization

Post by dvelleman »

Those threshold examples look fantastic. Yes, I'll definitely give that a shot.
spamsickle
Posts: 596
Joined: 06 Jun 2009, 23:57

Re: K'ichee' book digitization

Post by spamsickle »

Mayan cocktail napkins. Who'd've believed it...
Post Reply