Generation of Bilingual Dictionaries using Comparable and Quasi Comparable Corpora

Author: Ajay Dubey
Date: 2016-07-28
Report no: IIIT/TH/2016/45
Advisor:Vasudeva Varma

Abstract

The amount of information available on the web is increasing rapidly. The number of internet users is also increasing every day. A significant section of internet users is monolingual. They want to express themselves in their native language and also seeking information in the same. Hence, multilingual content over the internet is also increasing at a rapid pace. There is a need of systems which empowers such internet users to let them express themselves in the language of their choice and can provide them answers to information requests irrespective of the language barrier. Cross-lingual term associations are very important for many interlingual applications. Machine Translation system is a sub-field of computational linguistics which aims at translating text from one language to other. All MT systems at the core depend on bilingual dictionaries. The bilingual dictionaries are important resources for such NLP applications as statistical machine translation and cross-language information extraction systems. The bilingual dictionary is one such important resource where entries are word translations. They also can serve to enhance existing dictionaries, for second language teaching and learning. Manually created resources are usually more accurate and do not contain noisy information, in contrast, to automatically learned dictionaries. Scientific community seeks for the methods to achieve similar accuracy level and broader terminology scope by automatic means. In this thesis, we try to address the problem of generating bilingual dictionaries automatically. We propose two different approaches to generating bilingual dictionaries for English-Hindi pair. Both the approaches proposed in this thesis are language independent and hence can be used to build dictionaries for other language pairs which are phonetically rich. Hindi is very under-represented on the web because of many technical and socio-cultural reasons. Many languages, like Hindi, suffer from the limited availability of the language resources/ tools. It is crucial to develop a language-independent approach for such languages. In this thesis, though we have shown our experiments for English and Hindi language pair, our approach can be easily extendable to other language pairs as well. We didn’t use any language specific resource in the approaches which we proposed here. We have chosen two of the most reliable document resource for English-Hindi pairs viz. Wikipedia and news streams. We use Wikipedia as comparable corpora and ever growing news corpus as Quasi Comparable corpora. In the first approach proposed, we exploit the structural properties of documents to build a bilingual English-Hindi dictionary. The main intuition behind this approach is that documents in different languages discussing the same topic are likely to have similar structural elements. We proved by applying our approach on Comparable corpora containing documents having structural properties. Documents are inherently structurally divided into many sections and subsections. We link these sections to corresponding in the other language using a statistical approach. These mini sections are further divided into comparable sentences. After using some basic NLP techniques, we chose the most co-occurring pairs, which are a candidate for being a dictionary pair. One of the major contributions of this approach is that the dictionary contains translation and transliteration of words which include Named Entities to a large extent. The approach proposed here is language independent. In the other automatic method, we proposed exploits continuous quasi-comparable corpora to derive term level associations for the enrichment of dictionaries generated by the first approach. The approaches uses easily accessible (i.e, comparable and quasi-comparable) data sources to derive dictionary entries. The News is one such ever generating quasi-comparable corpora. Using this approach on a news dataset new vocabulary can be added to the dictionary. According to the same taxonomy, we try to link news stories across languages with the same focal event which might serve as comparable corpora. Afterwards, we apply dictionary generation algorithm to extract bilingual dictionary. Our method is completely automatic and does not rely on any labelled information. We compare the dictionary generated by our method to the dictionary generated via the first approach. In this thesis, we show this approach is able to derive interesting term level associations across languages. The dictionary generated from the Wikipedia is of high quality because of the highly structured information like title, section, sub-section etc. and interlingual links. We demonstrate that our method is able to find some associations which are not present in the Wikipedia using this approach.

Full thesis: pdf

Centre for Search and Information Extraction Lab

IIIT Hyderabad Publications

Generation of Bilingual Dictionaries using Comparable and Quasi Comparable Corpora

Abstract