Towards developing tools for Indian Languages using Deep Learning

Author: Pravallika Etoori Etoori
Date: 2019-10-03
Report no: IIIT/TH/2019/110
Advisor:Radhika Mamidi

Abstract

Extensive research is being carried out in the field of Natural Language Processing (NLP) in the context of Indian languages. Spelling correction, word segmentation, and grammar checking are the fundamental problems in NLP. The aim of these problems is to identify noise in the data and correct it. These tools are important for many NLP applications like web search engines, text summarization, sentiment analysis, machine translation etc. Many methods have been developed for these problems for English, which usually exploit linguistic resources like parsers, large amounts of real world data etc. making it difficult to adapt them to other languages. For these problems, deep learning models have also been implemented for English. These models use parallel data of noisy and correct mappings from different sources as training data for automatic correction tasks. Indian languages are resource-scarce and do not have such parallel data due to low volume of queries and non-existence of such prior implementations. Spell correction is very crucial for applications like web search engines. In speech recognition and optical character recognition (OCR), word boundaries are not often comprehensible. Grammar checking is important to clean up the corpus before using it in any application. It is also useful as an aid for second language users and for applications like automatic essay grading. In applications like speech-to-text, handwritten text-to-text, and Machine Translation, applying spell checking, word segmentation, and grammar checking on the output will result in improved accuracy. Most of the existing language processing systems for Indian languages are rule-based and language specific. Majority of the works have a dictionary-based approach and check the errors against the dictionary. Due to lack of availability of training data for machine learning models, we create synthetic datasets using language rules. In this work, we present approaches to build automatic language correction systems for Hindi. We have proposed deep learning models for spelling correction, word segmentation, and grammar checking. The proposed approaches are applicable to any resource-scarce language. A comparative evaluation for each of these three problems shows that the proposed models are competitive with the existing correction techniques for Indian languages.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Towards developing tools for Indian Languages using Deep Learning

Abstract