header

NLP Tools for e-Gov Applications

 
C-DAC Logo
 

Introduction to Onomatology at GIST

C-DAC GIST has been involved in a lot of work in the domain of Onomatology or Onomastics. This is a domain that deals with variants of Proper Nouns.

As is known, Indian names (people or place names) can be written in different ways when written in English. There is no standard in this scenario which can tell what is the correct spelling for an Indian name; is it ‘Gita’ or is it ‘Geeta’? In fact there can be no standard. It is all about personal choice.

But this scenario poses a huge challenge when searching up a name, be it on the Internet or in a database; or when comparing 2 names programmatically in e-governance applications. It is highly critical to find the right person in e-gov applications.

C-DAC GIST has been actively working in this field for many decades now and has developed many products to address the different challenges faced by people.

List of Onomastics Tools and Technologies being developed in GIST :

  1. TransName - Transliteration Solution for Indian names
  2. NameScape
  3. Pin code Generator
  4. Name and Address Matching
  5. NameSearch

TRANSLITERATION OF INDIAN NAMES

C-DAC GIST provides high quality transliteration of Indian people and place names from English to 28 language/scripts, viz.

Assamese Maithili (Devanagari) Sanskrit
Bangla Maithili (Tirhuta also called Mithilakshar) Santhali
Bodo Malayalam Santhali (Ol-Chiki)
Dogri Manipuri (Assamese-Bengali) Sindhi (Devanagari)
Gujarati Manipuri(Meitei Mayek) Sindhi (Perso-Arabic)
Hindi Marathi Tamil
Kannada Modi Telugu
Kashmiri (Devanagari) Nepali Urdu
Kashmiri (Perso-Arabic) Oriya  
Konkani Panjabi  

These language/script pairs cover all the 22 Scheduled Languages of India.

Transliteration from English to 28 languages/scripts.

Transliteration from 28 languages/scripts to English.

Transliteration from any of these 28 languages/scripts into other Indian language/script pairs.

28-lang

This tool also provides translation of addresses written in English to Indian Languages and vice versa

translation

The solution is available in the following flavours :

NAMESCAPE (NAME MATCHER)

With the government’s emphasis on linking national databases, like Adhaar and PAN to each other, as well as linking other databases (e.g. beneficiary data) to Adhaar and also measures like linking banks accounts and mobile numbers with Adhaar, it becomes impertinent to have tool that would verify whether the two names being linked belong to the same entity.

As it is impossible for a human to go through millions of records while verifying, it is necessary for a tool to do that job, which helps in finding whether the two names being linked are of the same person.

The problem arises because, very often, the name a person enters in one location can be significantly different from that entered in another location, this happening advertently or inadvertently. The issues occurring can be a multiple of the following factors :

  1. Initials in one location vs full name in another
  2. Name variants (e.g. ‘gita’ vs ‘geetha’)
  3. Typos
  4. Concatenated names
  5. Re-ordered names
  6. English in one location vs Indian Language in another

NameScape is just such a tool that aids the verifier in finding out whether the two names point to the same person or not by providing a score. It is very fast and most importantly has a very high level of accuracy. This tool is very useful when linking two names with each other.

Another area where it has been successfully deployed is by Credit Information Bureau to check whether a person applying for a loan is a defaulter or not.

In fact it can be used anywhere where proper nouns are to be matched. E.g. it can be deployed even by a portal of songs/movies where a user would want to find the song/movie by typing its name, where he can very easily make typing mistakes.

namescape

PIN CODE GENERATOR

This is a unique product, in that is it able to compute the Pin code of any address in India, that is written without a pin code. Also if the pin code mentioned is incorrect, it is able to predict the correct one.

NAME AND ADDRESS MATCHING

The Name and Address Matcher (NAM) is another product developed in helping to find whether the two entities are same based not just on name, but on address as well. It is different from the above mentioned NameScape in that, NameScape does a one-to-one matching, whereas NAM does many-to-many matching.

Essentially it helps in finding duplicates from within the name/address database, as also across two name/address databases. In effect, it can also be used to find entities that are not present in both databases. E.g. finding property tax defaulters of a give city, whose name appears in the electricity database, but not in the property tax database.

Matching of addresses is a very difficult problem. All the issues mentioned in NameScape are applicable. And to make matters worse, people sometimes write a locality in one address and a sub-locality in another. Matching such addresses is quite complex.

Also while matching the names, it is entirely possible that in one database, the spouses name is mentioned, whereas in other his own name is mentioned.

In spite of all these issues, this product gives excellent results.

address-matching

NAME SEARCH

A name can be stored in many variants in a database either advertantly or inadvertently. It can be stored as a spelling variant (e.g. ‘Geeta’ or ‘Gita’), or as an initial, or it can be concatenated (‘Geetaben’ instead of ‘Geeta Ben’), or can have suffixes/prefixes, or there can be typos. And they can be a combination of all the above.

Needless to say, it can be a challenge if you do not know the exact spelling stored in the database.

The NameSeach solution has been developed to handle exactly this challenge.

It is script-agnostic, ie. A name can be stored in Gujarati and you can search it by inputting the name in English or Hindi or any other major Indian script.

name-search

For more details, please contact:

More information on GIST products
E-Mail:
info.gist@cdac.in

Sales related information
E-Mail:
sales.gist@cdac.in

Support related information
E-Mail:
support.gist@cdac.in