NLP and Resource Creation Tools

Natural Language Processing Resources and Resource Creation Tools

C-DAC GIST has always been at the forefront of the development of new tools and technologies in the area of Natural Language Processing (NLP). This tradition of cutting-edge technologies is continually upheld at the GIST Labs where new tools compatible with the needs and requirements of today's fast developing digital world are being developed. And at the root of all these technologies are NLP Resources.

Dictionaries and other NLP Resources are a valuable database in a country like India where Cross-Lingual Information Querying systems are urgently needed. They are also needed in areas such as E-Governance or Teaching Systems or Search-Engines. GIST has started work on developing dictionaries in joint collaboration with the Language Boards and Academies of the particular linguistic region. The dictionary database can be in the shape of a mono-lingual or bi-lingual database or it can be a dictionary of synonyms or antonyms or idiomatic expressions common to the language.

Since dictionaries are often made by hand using traditional indexes, a dictionary validation and building tool has been created to ensure that the dictionaries are properly indexed and that the maximum information within the dictionary is retrievable.

List of NLP Resources being developed in GIST :

Spellchecker dictionaries
Corpus
Synonym Dictionaries, Antonym Dictionaries
Verbnet
Online Thesauri
Visual Thesaurus
CLDR

List of NLP Resources Creation Tools being developed in GIST :

Concordance
Dictionary Tagging Tool
Thesaurus And Dictionary Building Tools
Gist Synonym Builder
Thesaurus Generation Tool

Spellchecker Dictionaries

GIST has manually curated spellchecker dictionaries for 14 languages : Assamese, Bangla, Bodo, Gujarati, Hindi, Malayalam, Manipuri, Marathi, Nepali, Odia, Punjabi, Tamil, Telugu, Urdu.

These also have PoS info built into every word. But at the same time they are highly compressed, so that they take up only small disk space.

CORPUS

GIST has developed a large text corpus most Indian languages. This is a cleaned corpus running into millions of words.

It is continually being updated from topics such as daily news, philosophy, poetry, literature, advertisements, general knowledge, current affairs, etc.

VERBNET

GIST is working on an exciting project of developing a Verbnet for Indian Languages, taking cue from the Verbnet developed for English.

A Verbnet is a lexical resource that incorporates both syntactic and semantic information about verbs. This is information would be useful for higher-NLP like Grammar-Checker, Translation, etc.

Currently it is being developed for Bengali, Malayalam, Tamil and Telugu.

SYNONYM DICTIONARIES, ANTONYM DICTIONARIES

Synonym and Antonym dictionaries are being created for some Indian Languages.

ONLINE THESAURI

Thesauri which provide much more semantic information than dictionaries are a vital tool for search-engines, data-mining and information retrieval. GIST has started work on the creation of a Thesaurus Building Engine, which will ensure that the structure of the thesaurus with its hyponyms and hyperonyms is correctly indexed permitting fast and quick information retrieval.

GIST VISUAL THESAURUS

GIST Visual Thesaurus represents thesaurus based word map for the entered word in interactive and easily explorable manner. Its unique and attractive graphical visualization of word map and word net makes it easy tool to use and increases the learning thrust. It allows the word inputting in Unicode for Indian languages. Currently Hindi and Gujarati are supported. It is targeted to one and all who want to replace an idea with a word and also to those who want to explore and learn language.

CLDR (Common Locale Data Repository)

CLDR is the largest and most extensive standard repository of locale data. This data is used for software internationalization and localization i.e. adapting software to the conventions of different languages. Indian language Data for common software tasks as formatting of dates, times, time zones, numbers, and currency values; sorting text; choosing languages or countries by name; and many other categories can be entered. It creates UNICODE compliant linguistic resource for eventual development of high-end NLP tools and technologies.

Features:

It is a desktop application.
Best suited for offline data entry where bandwidth issues are of major concern.
It reads the English CLDR file (xml file ,UNICODE compliant, downloaded from Unicode CLDR site) from disk.
The GUI is made such that; ready reference of English CLDR data is made available in front of user while entering his native language data.
GUI : Grid displaying English CLDR data in one column and in other column the data in the language for which CLDR is to be created can be entered.

CONCORDANCE / CORPUS ANALYZER

GIST has developed a Concordance that is a great help for anyone (especially linguists) who want to analyse the behaviour of or patterns found in a language. One needs text corpora which is provided to the tool.

The tool can then help one find all the contexts in which all n-grams of the corpus are found. Optionally one can find only a particular word or phrase (partial or whole) within the entire corpora, whose context is then shown to you.

It can find not just words/phrases, but also patterns based on Parts-Of-Speech. E.g. it can help you to find all occurrences of Noun followed by a Verb, or all occurrences of Adjectives ending with particular characters followed by a Noun. In fact any pattern can be searched.

It has in-built NLP tools like spell-checker, grammar-checker, syntactic parser, etc. which help in finding grammatical errors and typing errors within the corpora. There is also a very easy way of editing/updating the corpora.

DICTIONARY TAGGING TOOL

Dictionary Tagging Tool is a language resource development software developed by GIST, C-DAC Pune. This software targets linguists as its end user. Presently the software is for Hindi language. Dictionary Tagging Tool is a database based tool which allows an user to enter a word with all its grammatical details like Etymology, Class, Gender, Denotative Meaning, Connotative Meaning, Domain Based Meaning, Collocations etc.

The present version of Dictionary Tagging Tool comes with noun tool.

Need for Dictionary Tagging Tool

India is rich in Languages. And with the spread of internet in India, it is required to make internet rich of Indian Languages so that it can reach to masses. Many language experts are putting their best efforts to make their knowledge available to the IT world. A software like Dictionary Tagging Tool is needed to allow the experts to share their knowledge with us to make a rich database of Indian Languages. This tool is a web based tool so any distant authorized user having internet access can use this software.

his software can be used for creating a dictionary.

Features:

Allows for creation of a dictionary database
The main user would be the Linguist community interested in creating an on-line dictionary
The dictionary can be created on-line by a user from his machine.
The data is tagged for the following areas:
1. Head Word
2. Etymology
3. IPA (Automatic Conversion)
4. Grammatical Information in shape of tags.
5. Semantic categories : Denotation and Connotation.
User can add extra tags in case the existing tags are insufficient.
Grammatical categories for Nouns are automatically generated. Typing will generate out arranged according to their inflexional forms.
The data is secure at the user-level Data created by one user can be viewed but not modified by another user.

Thesaurus And Dictionary Building Tools

In the areas of NLP, thesauri and dictionaries contribute as major databases for various activities. They are rich source of words and synonyms, which is highly required for tools and applications running on corpus. They are also the backbone of NLP related work like machine translation, search engines and also for developing as well as evaluating spell checkers.

The need for high-end Indian language databases in official languages constantly makes itself felt and C-DAC , Gist has taken up the challenge to provide unique and simple solution for multilingual country like India by proposing tool that facilitate the generation of thesauri and dictionaries.

Gist Synonym Builder

This tool is designed for building a large database of synonyms for respective headwords. The Gist-Synonym Builder Tool is a good way to digitalize and store synonym data. The Encoding for the stored data is UNICODE. Rarely used synonyms can be added for head word. Also Grammatical information can be preserved for head word.

THESAURUS GENERATION TOOL

The Thesaurus generation Tool is a good way to help digitalize and store. Thesaurus data in XML file format. Various traditional Thesauri of different languages are studied thoroughly to design it. The structure of the Thesaurus is finalized to suit different languages. Thesauri for Bengali & Telugu has been entered successfully using this tool.

Also Thesauri exist for Hindi, Gujarati. Thesauri for Telugu with the 56579 words and for Bengali having words 76223 has been entered using this tool. A Thesaurus GUI can be used for focused search in Search Engines. Other areas of applications are essential plug-ins for search engines, learning tools, e-governance, word net, semantic web and prediction dictionaries to name a few major areas.

For more details, please contact:

More information on GIST products
E-Mail: info[dot]gist[at]cdac[dot]in

Sales related information
E-Mail: sales[dot]gist[at]cdac[dot]in

Support related information
E-Mail: support[dot]gist[at]cdac[dot]in