Imla Shanaas: The Urdu spell-checker
Introduction
As the name suggests Imla Shanaas is a spell-checker for modern Urdu used both in India and Pakistan. The Spell-checker has features which incorporate the latest in both technology as well as in language
- The dictionary comprises over 70,000 root words which when exploded can spellcheck around 700,000 words in Urdu
- The words in the dictionary are based on the latest spelling norms so as to ensure full compliance with the Urdu Imlaa.
- The dictionary is a judicious
mix of vocabulary culled from lexical databases as
well as corpora covering topics such as daily news,
philosophy, poetry, literature, advertisements, general
knowledge, current affairs, basic science vocabulary,
mathematical terms as well as vocabulary from encyclopedia
to provide the largest range possible of spell-checking.
- Suggestions are the heart
of a Spell-checker. Based on suggestion heuristics
as well as the most common errors made by Urdu speakers,
Imla Shanaas provides normally a hit within the top
three suggestions. An intelligent word-splitting algorithm
ensures that compounding is safely handled. Airabs
are also accounted for and the spellchecker
can handle all and every diacritic mark used in modern Urdu.
- Imlaa Shanas can handle Unicode, UTF8 as well as PASCII (the proprietary standard of C-DAC GIST).
- A floating keyboard allows the user to correct text within the text-box itself.
- The file can be saved in multiple formats: PASCII, Unicode to suit the user's requirements.