Robust Representation Learning for Low Resource Languages

Author: Syed S. Akhtar
Date: 2018-03-10
Report no: IIIT/TH/2018/11
Advisor:Manish Shrivastava

Abstract

Understanding the meaning of words is essential for most natural language processing tasks. Word representations are means to mathematically represent the meaning of a word in a way that computers can understand. These representations are often in the form of vectors in which words are represented in a continuous vector-space of fixed dimensionality also referred to as word embeddings. In this thesis, we focus on generating better and reliable word representations for low resource languages. Many languages, though widely spoken, are largely under-represented in this area of research. One of the main reasons for this is the lack of reliable evaluation metrics to compare between different approaches of building these embeddings. Word similarity task is a widely used, computationally efficient method to directly evaluate the quality of word vectors. It relies on finding correlation between human assigned similarities between words, and those between corresponding word vectors. We release word similarity datasets for six low resource languages – Urdu, Telugu, Marathi, Punjabi, Tamil and Gujarati. For the construction of these datasets, our approach relies on translation and re-annotation of word similarity datasets of English. We also present baseline scores for word representation models using state-of-the-art techniques for Urdu, Telugu and Marathi by evaluating them on newly created word similarity datasets. For linguistically similar languages, we show that it is possible to use the better trained word representations of the more resourceful language for the other language in the pair, using a projection learning approach which relies on a mapping between words having similar meaning from the two languages. This cross-lingual vector space transformation results in state of the art results on word similarity test sets of French and German - an increase of 13% in case of French and 19% for German, using English as the source language. We also go on to demonstrate that this approach is better suited for linguistically similar language pairs like Hindi-Urdu (where 60% words are simply transliterations of each other) than English-German or English-French. We go on to see how we modelled prefix-suffix based morphoology using a similar technique.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Robust Representation Learning for Low Resource Languages

Abstract