Term Translation disambiguation in Cross-Language Information Retrieval (Case Study: Translation From Arabic To English)

ABSTRACT

Cross-language information retrieval (CLIR), where queries and documents are in

different languages, become one of the major topics within the information retrieval

community. The important step in CLIR is the translation. This research proposes a

term translation disambiguation method based on co-occurrence statistics for

translation in Arabic-English CLIR.

There are multiple ways to perform query translations: employing machine

translation techniques, using parallel corpora or using bilingual dictionaries. The

first two approaches are very labour intensive. Manual hand-coding of linguistic,

semantic and pragmatic knowledge is required for a machine translation engine to

produce good translations. This can be quite overwhelming when the domain of

coverage is wide. A great deal of work is also required for building parallel

collections when using the second approach. With the increasing availability of

machine-readable bilingual dictionaries, the third approach has become a viable

approach to Cross-Language Information Retrieval (CLIR), but in this approach,

resolving term ambiguity is a crucial step.

In this research the ambiguity problem was resolved by co-occurrence statistics. Cooccurrence

technique based on the hypothesis that correct translations tend to cooccur

together in the target language collection. Therefore, the valid translation

among a set of possible synonymous candidates of a certain source query term is

expected to have high frequency of co-occurrence with the translations of the other

terms in the same source query.

After the document set divided to fixed size window to overcome varying in

document length problem, the degree of association is calculated using mutual

information measure because it simple and produce high correlation between terms

even though they not appeared very frequently in document set.

The results of developed method proved that co-occurrence statistics can reduce the ambiguity problem and it works well in case of diacritics and homonymous.