In this digital day and age, it has become obligatory to have all the available information in a digital form recognized by machines. In the country like India, where there is abundance of information in the form of manuscripts, ancient texts, books etc that are traditionally available in printed / handwritten form, such printed material are in-adequate when it comes to searching information among thousand of pages. It has to be digitized and converted to a textual form in-order to be recognized by machines doing searches of a million pages / second. Then only, the true knowledge of Indian history, tradition and culture would be available to the masses and the digital revolution would be said to have reached the information age.

Optical character recognition plays an important role in achieving this. It converts the scanned images of books, magazines, and newspapers into machine-readable text.

Almost all Indian scripts are cursive in nature making them hard to recognize by machines. Scripts like Devanagari, Gujarati, Bengali and many others have conjuncts or joint-characters increasing segmentation difficulties. To add to that, various fonts of various sizes used for printing texts over the years, the quality of paper, scanning resolution, images in texts etc asks for a challenging image processing job. Also, it requires huge linguistic know-how to apply post-processing. The diagram shows the basic building blocks of an OCR system.

GIST Research Labs are committed to applying all of its image processing skills and linguistic know-how gathered over the years into developing a highly accurate Optical Character recognition engine.

ocr block diagram

C-DAC Gist Lab's research seeks to develop an Optical Character Recognition engine, which will enable highest levels of accuracy in converting Indian language images to text. The basic OCR for Devanagari script named 'Chitrankan' can be found in its product portfolio.


All these noises contribute to the decrease in accuracy of OCR system. As a result of this having a noise correction routine in place becomes inevitable



bilingual text

Bi-Lingual Nature of Text

In the scenario of a country like India, where there is a influence of many other European languages, like English, French and Portuguese, having these languages mixed in the text is inevitable.

Apart from the European languages, India itself has twenty-two official languages, which could also be found embedded in the text matter.

ocr design

C-DAC has developed novel character and pattern segmentation methods that have been applied for the first time for solving the abovementioned problem of Devanagari OCR


For more details, please contact:

More information on GIST products

Sales related information

Support related information