IIIT Hyderabad Publications |
|||||||||
|
Systems and Resources for Telugu: Question Answering and SummarizationAuthor: Priyanka Ravva 20172157 Date: 2023-02-24 Report no: IIIT/TH/2023/9 Advisor:Manish Shrivastava AbstractNatural language processing (NLP) is a bridge between the computer and human interactions in their natural language. NLP has wide variety of applications such as machine translation, text summarization, question-answering, sentiment analysis, etc. All these applications have created huge impact in the society with different use cases such as chatbots, voice assistants (Alexa, Siri), recommendation systems (YouTube, Hotstar) etc. Most of these NLP applications are limited to few high resource languages like English. Notably, in an Indian scenario, where each state is having its own language, only 10% of people communicate in English. This motivates us to develop NLP systems and resources in Indian languages that will create a huge impact in multilingual Indian society. In this thesis, we created question-answering (QA) and text-summarization resources and systems in Telugu language. The main aim of QA system is to provide an accurate and concise answer to the question asked by human in natural language. In this thesis, we created a question classification dataset which consists of 1037 samples and also explained the ambiguities, challenges involved in creating the dataset. We built an end to end pipeline for QA system and named it as AVADHAN. We performed comparisons between three different classifiers for the Telugu Question Classification (QC) module. QC will be helpful to reduce the search space while extracting the answer for the given query. Text summarization is a way of obtaining short and precise summary from the given document of arbitrary length. In this thesis, we have proposed a pipeline that crowd-sources summarization data and then aggressively filters the content with automatic and partial expert evaluation. With this pipeline we have created TeSum: high quality human generated abstractive summarization corpus for Telugu. This corpus consists of 20329 high quality articlesummary pairs and this is the first high quality and large abstractive summarization corpus in Telugu as per our knowledge. We have also explored different sequence to sequence neural networks on TeSum corpus and provided ROUGE scores. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |