GIST contributions towards Standardization in Indian Language Computing - An Overview
Need for standards - Basic Hardware systems and / or Software applications are designed and developed even today with only English in mind. To proliferate the acceptance and usage of Indian languages, the Indian language implementation / flavour needs to sit on top of existing applications and hardware frameworks. GIST has a focus on all 22 official Indian languages. Of these - Assamese, Bengali, Bodo, Dogri, Hindi, Gujarati, Kannada, Konkani, Marathi, Malayalam, Maithili, Nepali, Oriya, Punjabi, Santhali, Tamil, Telugu use a left to right writing style while Urdu, Sindhi and Kashmiri are mostly used in right to left mode. There are several overlaps wherein one language may use multiple scripts (eg: Konkani may be written in Devanagari, Kannada or Roman) as well as having one script like Devanagari cater to multiple languages. In order for any application to reach the masses of India it is important to support Information Technology in various languages of India.
On the Web and Mobile platforms GIST has researched various aspects of the W3C recommendations and submitted the findings related to various languages including the right to left scripts. This activity is especially important in order to bridge the digital divide and proliferate the use of Indian languages on various modern media including television, handheld PDAs, Information access points, etc.
C-DAC GIST has participated in various standardization activities pertaining to language technology. It is also involved in standardization of heritage scripts of India.
Standardization
- W3C (Languages on Web)
- Internationalized Domain names
- E-Governance
- Linguistics formats,etc
- Storage
- Input
- Display Fonts
W3C (World Wide Web Consortium)
Introduction
Under the aegis of DIT, C-DAC GIST has come up with
a draft report on the representation of the seven
languages catering to the various recommendations
of the W3C. Of these, four belong to the Brahmi family
and use Left To Right (LTR) mode to display the characters:
Gujarati, Marathi, Konkani and Dogri. While Sindhi,
Kashmiri and Urdu which are Perso-Arabic use the Right
To Left (RTL) mode of visual display. C-DAC GIST has
extensively researched the various aspects related
to Localization (l10n) and Internalization (i18n).
The broad areas of research and recommendations include:
- Representation of Indian Languages on the World Wide Web.
- Character encoding issues for Languages such as Marathi, Konkani, Gujarati, Sindhi, Kashmiri, Dogri, Urdu, etc. in UNICODE
- Language Names - RFC-3066
- CSS and Text Formatting
- CLDR Common Language Data Repository
- Mobile Web Initiative - Recommendations for correct representation of Indian Language on Mobile PDA Devices
- Internationalization Tag Set - XML Tags, which help in localization like translate, ruby, direction, etc.
Dynamic CSS Tester
Dynamic CSS Tester is a comparison tool for comparing effects of various CSS as they are applied on UTF-8 data. It allows you to easily preview and compare different CSS side by side with various CSS applied on them. This case-study aims to investigate issues related to rendering or display of Indian language content in UTF8 and the effects of various CSS styles on it. To use it, all you have to do is to simply enter the text you would like to preview, then modify the various styles until you find a style set you want. If see any problem with the applied CSS, take a screenshot of a problem and send it along with the mail that you can send us with the help of the Feedback link given on the same page. In case if you feel that your data is not correctly rendered in the mail, just click GetSample button on the page, copy the code generated by it in the text box next to the button, and paste that in the Mail. The feedback that you send will be verified by GIST and consolidated and forwarded for further action to the W3C.
Internationalized Domain Names
In this age of Information Technology (IT) with the entire Globe being integrated into a web-linked village with the knowledge as the sole differentiator, development of convivial Access Technology has gained prime importance. Especially for India, with its diverse and multi-lingual heritage and culture, the Internet is expected to play dominant integrating role for integrating almost all aspects of social and economic endeavor.
Introduction
GIST undertook research and study of various RFC and
their applicability vis-à-vis Indian Languages under
the guidance of the DIT.
The research is focused on Domain Names in Indian
languages for Hindi, Gujarati, Urdu, etc. and included
the following:
- NamePrep and StringPrep Profile - RFC-3492
- PunyCode: Bootstring encoding - RFC-3454
- StringPrep - RFC-3987, Path of IDN, etc. GIST has submitted recommendations and reports related to possible pitfalls, phishing, (online fraud arising from similar urls), etc. whilst implementing IDN in Indian Languages.
E-GOVERNANCE
GIST has contributed to recommendations related to the entire lifecycle of developing Indian language compliant e-governance applications.
These recommendations arise from C-DAC GIST's expertise in Indian languages and use of GIST tools and technologies in various large-scale, Indian language data-centric e-governance projects.
C-DAC GIST Tools have been used in several turnkey G2C (Government to Citizen) applications both at state and central level. GIST has also assisted several agencies in implementing various medium and large-scale projects.
It also participates in various
forums for standardizations of the languages of India.
GIST is working towards standardization
of Storage, Inputting and Display standards for Bodo,
Santhali, Dogri, Maithili, etc. which have been added
recently to the list of official languages.
Linguistic Formats and Heritage scripts
- Dictionary creation tools formats and tagging tools have been recommended for various languages. These formats will streamline creation of digital corpus dictionaries.
- C-DAC GIST is highly involved in standardization of all Information Technology related aspects of Heritage scripts such as Vedic, SamaVedic, Grantha, etc.
- Several C-DAC GIST Tools have been used in various Digital Library related projects.
Storage
- UNICODE - ISCII 88 based - Today UNICODE is the most widely accepted and supported encoding for Indian Language support. C-DAC GIST has contributed to the representation of Indian Languages in Storage standards such as UNICODE. Some recommendations especially for scripts such as URDU, Sindhi, Kashmiri, Dogri, Bodo, Santhali and Maithili are at various stages of review and finalization. UNICODE is a character encoding system. In UNICODE 0x600 onwards for PersoArabic Scripts of India and 0x900 for Devanagari (Hindi, Marathi, etc.) represent maximum languages of India.
- UNICODE consortium has come up with an evolving standard currently at version 5. Changes for representation of several Indian languages is still in progress. Need for Normalisation (eg: multiple representation of characters with Nukta), Need for a collation sequence and sort order, ZWJ and ZWNJ issues, issues related to Internationalized Domain Names (IDN) are being looked into by GIST R&D.
- Like with most standards and recommendations, compliance issues are a major concern. In the absence of a certifying authority, inadequate or faulty support in applications , rendering and display engines , slow or expensive updates are a major bottleneck to proliferation of Indian Languages.
- ISCII: The Bureau of Indian Standards (BIS) has adopted it as the ISCII - Indian Script Code for Information Interchange (IS 13194:1991). The 8-bit flavour is the most commonly used standard and it has minimal requirements of CPU, memory. ISCII is a ‘character based encoding system’. These standards define common phonetic alphabetic set (character set).
- PASCII: PersoArabic Script Code for Information Interchange. For Urdu, Sindhi and Kashmiri - C-DAC GIST introduced standards for these languages when there were none available. Under the TDIL Initiative, various standards for storage, representation and entry were recommended. Several of these today find a place in the current industry standards such as UNICODE.
- ISCLAP: Standard for Pager communication: In 1997, Pager Technology was considered mature for adopting Indian Languages. Motorola took the lead and requested standardization of a coding scheme for Devnagari and Gujarati. The Telecom Engineering Center (TEC), C-DAC, DoE, and the Pager manufacturers agreed to the formulation of "Indian Standard Code for Language Paging". This was done, keeping the compatibility to ISCII in mind such that data inter conversion at sending end from Terminals and at receiving end would be based on a simple formulae.
Input
- INSCRIPT Keyboard Layout - INSCRIPT is a part of the ISCII standard. Supported by major OS vendors and applications. INSCRIPT is based on the phonetic nature of Indian scripts. The BIS ISCII document (IS 13194:1991) also describes the keyboard layout for each script. This traditional keyboard is widely accepted and supported by most of the Multinational Companies (MNC) who support Indian languages.
- Traditional INSCRIPT layout is very scientific. It supports consonants on right and vowels on the left. It also has phonetic base with higher consonants and higher vowels on the shift of the same key thereby increasing the speed.
- GIST has also developed Limited Keys Input mechanism, prediction algorithms and smart writing systems, which reduce the number of keys required for Indian languages.
Display Fonts
OPEN TYPE FONTS - For UNICODE support in various applications, GIST
Labs has developed Open Type Fonts for various scripts
including Urdu (Naskh as well as Nastaleeq/Nastaliq),
Sindhi and Kashmiri. Various modern OS today support
OT Fonts for viewing UNICODE data. Several GIST Tools
have also been upgraded to support the OT-Font technology.
ISFOC - Intelligence
based Script FOnt Code:
The primary rule of thumb for typography is - If the text does not look good we do not
feel like reading it. Good typography is characterized
by well-structured letterforms in a particular font,
pleasant inter-letter spacing, ideal word spacing and
healthy line spacing. Emphasis has been placed on text
compositions (horizontal as well as vertical) and final
reproduction on output devices such as screen and printers,
aesthetic rendering and display for True Type Fonts.
- Bilingual font - Allows representation of English as well the Indian language of choice. Supports bare minimum features of any script. Ideal for developing applications having bilingual data, because it supports English as well as one Indian Script.
- Monolingual font - They represent a lot more combinational characters as compared to bilingual fonts. Recommended if you are looking at a pure representation of the script.
- Bilingual-Web font / Monolingual-Web font C-DAC GIST has recommended the use of web-font types. The non-web fonts are supported only for backward compatibility. Using these font types makes applications more robust and immune to problems related to the display of Indian languages. For some scripts like Tamil, GIST recommends using only these types of fonts.
- ISO - 8859 Compliant fonts for use with specialized applications have also been developed and deployed for various Indian languages. ISO fonts are used for Linux and some windows applications (eg: oracle d2k, 9iAS, crystal reports, .Net, etc.)
- Note: Typing is independent of Font Type in use.
Naming conventions for GIST TRUE TYPE (TT fonts)
1. A. Mnemonics :
Assamese (AS), Bengali (BN),
Devanagari (DV - catering to Hindi, Marathi, etc.),
Gujarati (GJ), Kannada (KN), Malayalam (ML), Manipuri
(MN), Oriya (OR), Punjabi (PN), Tamil (TM), Telugu (TL)
or
1. B. Corresponding Bilingual : ASB, BNB, DVB,
GJB, KNB, MLB, MNB, ORB, PNB, TMB, TLB
or
1. C. Corresponding Bilingual Web : ASBW, BNBW,
DVBW, GJBW, KNBW, MLBW, MNBW, ORBW, PNBW, TMBW, TLBW
or
1. D Corresponding Monolingual Web : ASW, BNW,
DVW, GJW, KNW, MLW, MNW, ORW, PNW, TMW, TLW
2. Followed by hyphen
3. TT - indicating True Type Font
4. Name of font Surekh, Yogesh, Mukta,
Amar….
5. Numerals EN English numerals (optional)
Tamil, Telugu and Malayalam support only English numerals
- Tamil99 has Mnemonics - TAB for bilingual and TAM for monolingual
Example : "GJBW-TTAvantikaEN" is
- Font for Gujarati Bilingual ISFOC data.
- It is a Web-Font
- It is a True-Type font
- It is identified or named as Avantika
- EN indicates that the font has English Numerals.
For more details, please contact:
More information on GIST products
E-Mail: info[dot]gist[at]cdac[dot]in
Sales related information
E-Mail: sales[dot]gist[at]cdac[dot]in
Support related information
E-Mail: support[dot]gist[at]cdac[dot]in