Search Engine Technologies

 
C-DAC Logo
 

Search Technologies

The Web is constantly evolving from the basic Web to WWW2 and today to WWW3, which is mainly focused on man-machine interaction. Information retrieval, which is fast, accurate and user-friendly are the watchwords of the WWW3. Here, Natural Language Interfaces play a crucial role. The GIST Labs are working in two focal areas: Indian Language Plug-Ins for Search Engines and the Semantic Web

Search Engine PLUG-INS
Search Engines, even some of the best today, are mainly statistical in nature and the "heuristics" is based on statistical prediction and the ranking algorithm does not satisfy the user very often.

However useful they are, there are serious problems associated with their use:

  • Information over-kill without precision. Too much is as bad as too little content. If the user has to go through 12,000 pages to get relevant information, (s)he has neither the time nor the energy to go through the pages.
  • Low or zero Information. The converse is equally possible although more rare in occurrence.
  • Sensitivity to wording. A change in wording pops up different pages, as does very often a change in spelling or the use of spelling variants.
    o Monolithic results. If information is needed about various pieces of data, separate queries have to be initiated and then combined to meet the requirement. If the user wants to book a ticket on a train or a plane, multiple querying alone will meet the requirement.

In other words Web Content outstrips Web retrieval technology. Search engines are therefore at their best only "addresses" on the Information highway and to call them "Information Retrieval Tools" is a misnomer.

In the case of Indian Languages the problem is even more acute and can be spelled out as: Lack of correspondence between script and language, Intra-word Grammars, Multilingualism, complex inflectional or agglutinating nature of Indian Language, Legacy data, Spelling Variants, Incorrect spellings, Natural Query

Thanks to sophisticated linguistic tools the Indian language plug-ins allow such searches to be carried out.

Semantic Web
The aim of the Semantic Web is to allow much more advanced knowledge management systems such that:

  • Knowledge will be organized in conceptual spaces according to its meaning.
  • Automated tools will support maintenance by checking for inconsistencies and extracting new knowledge.
  • Keyword-based search will be replaced by query answering: requested knowledge will be retrieved, extracted, and presented in a human friendly way.
  • Query answering over several documents will be supported.
  • Defining who may view certain parts of information (even parts of documents) will be possible.

The Search Engine of tomorrow will be semantic Web-compliant and GIST Labs are already working in this area to develop a Semantic Web for Indian languages which, will address all major issues that are pertinent to Indian scripts such as tomography, ontology, creation of agents within the framework of a true Information retrieval.

Music Search
What is Music Search :

Music Search is a web based application, which simulates a Search Engine, in the Music Domain.

The searching is classified into four categories, viz, Actors, Movie Title, Singer, Song Title.

The search is based on exact match of the entered word and also similar sounding words.This application is powered by GIST made proprietary Homophone engine which is responsible for making the search so intensive and vast by not only giving the exact matches of the word but also providing with the similar spelling words or words having similar pronunciation as output.

It runs in all browsers supporting ajax.

How is it different from others :
When it comes to Indian Languages, there can be a lot of variation in the spelling of a given word when written in English.It's not easy for the user to type in the correct spellings of some complex Hindi words while searching for the same. So this search application ensures that the search output should include all the words with exact spelling, and also similar spelling of the word entered by the user. Say for e.g. if the user enters the word 'Abhijjet' then our search will give result of word Abhijeet and Abhijit.

if the user enters the word 'Bombay' then our search output includes Bambai, Bombai and Mumbai as well. This is a very unique feature exclusive to this application.

Usability :
This application is of great use in the Music Domain wherein a user has to search for Artist name, Movie name, Song title or Singer name.

Can be used as a plugin component for various music sites and would ease the usability of the site by facilitating such intensive search on the available database thus, making the site more user friendly.

Features :

  1. Easy to use web application
  2. Works with IE, Mozilla, Opera, Safari.
  3. Easy to alter the contents.
  4. Search output consists of similar sounding words thus providing every possible related entries, ensuring efficient searching.
Tools and Technologies used :
  1. Homophone Engine: This tool is powered by gist made proprietary Homophone engine.
  2. Servlet: It functions as the server side core component that services the client requests.
  3. Ajax: To display the output on the web page.

Problems with existing UNICODE based engines

Problem 1
A 'UNICODE only' search engine is not sufficient
UNICODE based searches are supported by engines such as Yahoo and Google. Thereby expressing confidence in the need for a search engine for Indian languages in the first place. Yet they are inadequate while catering to Indian languages.

1.1 Multiple encodings of Data: The websites with Indian languages (Including E-Gov applications which form a major chunk) may be in a variety of proprietary hack font encodings, from different vendors, which may or may not be suitable for web or in UNICODE / UTF-8. So Script Identification (of page encoding and its content) is difficult.

1.2 In UNICODE Indian language data requires normalisation (eg: ja+nukta in reserve).

1.2 unicode

Most applications do not cater to this.
1.3 Large userbase in India uses Windows9x systems to access the net over low bandwidth.

Problem 2
Several words have multiple correct spellings and alternate representation forms eg: the word Hindi may be written
with a bindi on top of the first syllable or with a half na.

problem 2

What should happen in case of searching such words in search engines
So also with the representations of the word vitthal

vittal hindi

Problem 3
Owing to historical reasons, Indian languages are rich in synonyms with the same word having various synonyms. These are used indifferently in web content.
Synonyms
- Similar meaning (eg: Bharat, India)
- colloquial terms

problem 3

Problem 4
Many languages one script.
Given a page with Devanagari Encoding
- Devanagari supports 54 languages / dialects of which the main ones are Hindi, Marathi, Konkani, Nepali. The search engine must be able to identify the language correctly to
avoid giving wrong results.
One language many scripts.How to search a word in Konkani
Konkani:Devanagari, Roman, Kannada, Malaylam
Sindhi: Devanagari, Gujarati, Roman, Perso-Arabic

Problem 5
Typographical variants.

Most Search Engines offer a 'did you mean' options based on statistics. In Indian languages it has been observed on the web that incorrect spellings may be used more often.

problem 5

Problem 6
Language variants.

Due to the complex nature of Indian Languages, a user should be given results, which include the linguistic variants including suffixes of searched terms.

problem 6

For more details, please contact:

More information on GIST products
E-Mail:
info.gist@cdac.in

Sales related information
E-Mail:
sales.gist@cdac.in

Support related information
E-Mail:
support.gist@cdac.in

 

Top