IIIT Hyderabad Publications |
|||||||||
|
Towards Building Question Answering Resources for TELUGUAuthor: VEMULA RAKESH KUMAR 2019701027 Date: 2024-06-22 Report no: IIIT/TH/2024/110 Advisor:Manish Shrivastava AbstractNatural Language Processing (NLP) is a cutting-edge field of artificial intelligence that empowers computers to understand and work with human language. It plays a crucial role in various practical applications that impact our daily lives. Its applications span a wide spectrum, including machine translation, text summarization, question-answering, and sentiment analysis, among others. These applications have left an indelible mark on society, giving rise to chatbots, voice assistants like Alexa and Siri, and recommendation systems on platforms such as YouTube, Netflix, and Hotstar. Within the realm of NLP, Question Answering (QA) represents a pivotal field. QA involves the intricate interplay of query formulation, document retrieval, and, at times, document summarization. Recent strides in this domain have given rise to comprehensive end-to-end systems capable of extracting precise answers from extensive text collections. These systems can be trained on expansive datasets, tailored either to a specific domain (referred to as closed domain) or spanning a wide array of subjects (referred to as open domain). Nonetheless, the efficacy of QA systems heavily hinges upon the curation of meticulous datasets, a process that demands substantial manual labor and resources. This challenge becomes particularly pronounced when considering Indian languages, where access to dependable and substantial datasets remains limited. In such contexts, the exclusive reliance on data-driven neural network approaches proves inadequate. Therefore, the imperative arises to strengthen available data resources and introduce innovative, data-independent techniques. Recent state-of-the-art models and new datasets have advanced many NLP areas, especially, Machine Reading Comprehension (MRC) tasks have improved with the help of datasets like SQuAD (Stanford Question Answering Dataset). But, large-high-quality datasets are still not a reality for low-resource languages like Telugu to record progress in MRC. This thesis intends to explore a resource-scarce language like Telugu, one of the widely spoken Dravidian languages in India, with native speakers of around 96 million. In this thesis, we present a Telugu Question Answering Dataset - TeQuAD with the size of 82k parallel triples created by translating triples from the SQuAD. We also introduce a few methods to create similar Question Answering datasets for the low-resource languages. Then, we present the performance of our models which outperform baseline models on Monolingual and CrossLingual Machine Reading Comprehension (CLMRC) setup Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |