Google's Tamil and Sinhala OCR tool

 

Introduction

Optical Character Recognition tool facilitates us to change the printed, handwritten text to editable text. Tamil and Sinhala have been challenging scripts to be handled by OCR tools. Though there are several researches have been carried out, still we do not have a product that gives reportable results. There are popular tools including Tesseract, which were successfully used for Latin based script, are giving poor results to Tamil and Sinhala.

In recent times Google has been working on a lot of language tools, including Machine Translator and OCR. Google has released (in May 2015) an OCR application, which comes as a part of drive.google.com application, using which we can OCR Tamil and Sinhala text.

Google claims that its OCR engine recognizes Images (.jpg, .png, and .gif files) and multi-page PDF documents (.pdf, max 10 pages).

Supported scripts

Acehnese, Acholi, Adangme, Afrikaans, Akan, Albanian, Algonquinian, Amharic, Ancient Greek, Arabic (Modern Standard), Araucanian/Mapuche, Armenian, Assamese, Asturian, Athabaskan, Aymara, Azerbaijani, Azerbaijani (Cyrillic; old orthography), Balinese, Bambara, Bantu, Bashkir, Basque, Batak, Belorussian, Bemba, Bengali, Bikol, Bislama, Bosnian, Breton, Bulgarian, Burmese, Catalan, Cebuano, Chechen, Cherokee, Chinese (Mandarin; Hong Kong), Chinese (Simplified; Mandarin), Chinese (Traditional; Mandarin), Choctaw, Chuvash, Cree, Creek, Crimean Tatar, Croatian, Czech, Dakota, Danish, Dhivehi, Duala, Dutch, Dzonkha, Efik, English (American), English (British), Esperanto, Estonian, Ewe, Faroese, Fijian, Filipino, Finnish, Fon, French (Canadian), French (European), Fulah, Ga, Galician, Ganda, Gayo, Georgian, German, Gilbertese, Gothic, Greek, Guarani, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Herero, Hiligaynon, Hindi, Hungarian, Iban, Icelandic, Igbo, Iloko, Indonesian, Irish, Italian, Japanese, Javanese, Kabyle, Kachin, Kalaallisut, Kamba, Kannada, Kanuri, Kara-Kalpak, Kazakh, Khasi, Khmer, Kikuyu, Kinyarwanda, Kirghiz, Komi, Kongo, Korean, Kosraean, Kuanyama, Lao, Latin, Latvian, Lingala, Lithuanian, Low German, Lozi, Luba-Katanga, Luo, Macedonian, Madurese, Malagasy, Malay, Malayalam, Maltese, Mandingo, Manx, Maori, Marathi, Marshallese, Mende, Middle English, Middle High German, Minangkabau, Mohawk, Mongo, Mongolian, Nahuatl, Navajo, Ndonga, Nepali, Niuean, North Ndebele, Northern Sotho, Norwegian (Bokmål), Nyanja, Nyankole, Nyasa Tonga, Nzima, Occitan, Ojibwa, Old English, Old French, Old High German, Old Norse, Old Provencal, Oriya, Ossetic, Pampanga, Pangasinan, Papiamento, Pashto, Persian, Polish, Portuguese (Brazilian), Portuguese (European), Punjabi (Gurmukhi), Quechua, Romanian, Romansh, Romany, Rundi, Russian, Russian (Old Orthography), Sakha, Samoan, Sango, Sanskrit, Scots, Scottish Gaelic, Serbian (Cyrillic), Serbian (Latin), Shona, Sinhala, Slovak, Slovenian, Songhai, Southern Sotho, Spanish (European), Spanish (Latin American), Sundanese, Swahili, Swati, Swedish, Tahitian, Tajik, Tamil, Tatar, Telugu, Temne, Thai, Tibetan, Tigirinya, Tongan, Tsonga, Tswana, Turkish, Turkmen, Udmurt Ukrainian, Urdu, Uzbek, Uzbek (Cyrillic; old orthography), Venda, Vietnamese, Votic, Welsh, Western Frisian, Wolof, Xhosa, Yiddish, Yoruba, Zapotec, and Zulu.

Steps to do OCR using Google's Engine

  • Upload your image / pdf to Google drive. (Make sure the size is not more than 2 MB)
  • Right click on the uploaded file on Google Drive.
  • Choose 'open with' -> Google Docs.
  • You are done! You will see the image as well the editable text in Unicoded format.

 

Personal Observations

I tested the tool with the samples given below. 

For Tamil script, It worked perfectly for Sample 1 and Sample 2, I could not see any mistakes. I would say the recognition rate ~100% for human evaluation. However, for Sample 3, which is an extract from an old book, results were not perfect; however, not too bad. It struggle with Tamil numbers and old shape of Tamil letters.

For Sinhala, I just captured an screen shot of Silumina, Sample 4, and tested, I see few recognition issues; but worked.

However, recognition rate of pdfs were very poor. I fed the sample 2 in form of pdf as well as in png. The png's recognition was ~100%, but the output for pdf was not considerable at all.

 

Sample 1

 

Sample 2

 

 Sample 3

 

Sample 4

 

For more information:

https://support.google.com/drive/answer/176692?hl=en

http://googleresearch.blogspot.com/2015/05/paper-to-digital-in-200-languages.html