A Java Implementation of an Extended Word Alignment Algorithm Based on the IBM Models

Authors: Chinnappa Guggilla,Anil Kumar Singh
Conference: In Proceedings of the 3rd Indian International Conference on Artificial Intelligence. Pune, India. 2007.

Date: 2007-11-16
Report no: IIIT/TR/2007/85

Abstract

In recent years statistical word alignment models have been widely used for various Natural Language Processing (NLP) problems. In this paper we describe a platform independent and object oriented implementation (in Java) of a word alignment algorithm. This algorithm is based on the first three IBM models. This is an ongoing work in which we are trying to explore the possible enhancements to the IBM models, especially for related languages like the Indian languages. We have been able to improve the performance by introducing a similarity measure (Dice coefficient), using a list of cognates and morph analyzer. Use of information about cognates is especially relevant for Indian languages because these languages have a lot of borrowed and inherited words which are common to more than one language. For our experiments on English-Hindi word alignment, we also tried to use a bilingual dictionary to bootstrap the Expectation Maximization (EM) algorithm. After training on 7399 sentence aligned sentences, we compared the results with GIZA++, an existing word alignment tool. The results indicate that though the performance of our word aligner is lower than that of GIZA++, it can be improved by adding some techniques like smoothing to take care of the data sparsity problem. We are also working on further improvements using morphological information and a better similarity measure etc. This word alignment tool is in the form of an API and is being developed as part of Sanchay, (a collection of tools and APIs for NLP with focus on Indian languages).

Full paper: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

A Java Implementation of an Extended Word Alignment Algorithm Based on the IBM Models

Abstract