IIIT Hyderabad Publications |
|||||||||
|
Building Telugu Corpora for NLP Applications: Paraphrasing, Question Answering, and Spelling CorrectionAuthor: Mani Kanta Sai Nuthi 2019701019 Date: 2023-06-23 Report no: IIIT/TH/2023/107 Advisor:Manish Shrivastava AbstractNatural Language Processing (NLP) is a rapidly growing field focusing on the interaction between computers and human languages. It involves utilizing computational techniques to understand and generate natural language text. Several NLP tasks, such as question answering, text summarization, and machine translation, are widely researched for many languages, including Indian languages. Indian languages are resource-scarce and have distinctive characteristics posing different challenges for NLP. Recent advancements in NLP have helped in the development of models and techniques that are specific to Indian languages. However, for Telugu, a south Indian language, a lot of research is still needed to improve the performance of several NLP systems. Progress of such systems will benefit the huge Telugu-speaking community worldwide to communicate and access information in Telugu through various NLP applications. This motivates us to develop essential NLP resources and systems for Telugu. Firstly, the thesis provides an overview of the fundamental concepts and techniques used in NLP. Then we approach three specific NLP tasks: paraphrasing, question answering, and spelling correction. For these tasks, we address the problem of resource scarcity and then present the techniques to create the data for such low-resource languages. In this thesis, we have presented paraphrasing, question-answering, and spelling correction resources for the Telugu language. For paraphrasing, we presented two manually created and annotated corpora of size 1544 and 10000+ samples, respectively. We have also discussed the necessity for manual intervention while creating such resources. We have introduced a Telugu Question Answering Dataset - TeQuAD, with a size of 82k parallel triples. We also proposed the guidelines and methodologies that can be followed to create a Question Answering dataset for low-resource languages. We presented a Spell correction system for the Telugu language with the help of a synthetically created dataset. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |