IIIT Hyderabad Publications |
|||||||||
|
Text Summarization for Resource-Poor Languages: Datasets and Models for Multiple Indian LanguagesAuthor: Vakada Lakshmi Sireesha 20171137 Date: 2023-05-20 Report no: IIIT/TH/2023/48 Advisor:Radhika Mamidi AbstractDocument summarization aims to create a precise and coherent summary of a text document. There exist a plethora of deep learning summarization models that are developed mainly for English, which often requires (i) a large training corpus and (ii) efficient pre-trained language models and tools. However, English summarization models for low-resource languages like Indian languages are often limited by rich morphological variation and syntactic and semantic differences. Also, the restricted form of supervision limits the generality and usability of low-resource languages due to the lack of annotated corpora. The Graph Autoencoder (GAE) model has recently shown superior performance on several NLP tasks, even with limited resources. In this work, we propose GAE-ISUMM, an unsupervised Indic Summarization model that extracts summaries. In particular, our proposed model uses GAE and leverages: (i) learning document representations and (ii) jointly learning sentence representations and summary of the document. For evaluation purposes, we introduce TELSUM, a manually annotated summarization dataset comprised of 501 document-summary pairs. Extensive experiments on existing low-resource datasets (XL-Sum) and TELSUM provide the following insights: (i) our proposed model displays state-of-the-art results on XL-Sum and report benchmark results on TELSUM, (ii) Surprisingly, the inclusion of positional and cluster information in the proposed model further improved the performance of summaries. We open-source our dataset and code 1 . On the other hand, with the advancement of various deep-learning methodologies and transformerbased models, summarization has advanced to a new level. However, consistent and standard datasets must be produced to benefit from these deep learning algorithms fully. The creation of dedicated resources is rarely seen for low-resource Indian languages, unlike English which hinders the progress of summarization. To this end, we create summarization resources for Indian languages by introducing ISummCorp (Indic Summarization Corpora) and IndicSumm (Indic Language Summarization Models). IsummCorp is a highly abstractive summarization dataset sourced from the Times Of India (TOI). It is manually annotated by experts across eight Indian languages. Human and intrinsic evaluations demonstrate the high quality, abstraction, and compactness of ISummCorp. IndicSumm is a set of diverse monolingual and multilingual models based on ISummCorp. We refined IndicSumm, by finetuning the sophisticated, multilingual pre-trained mT5 model. With ISummCorp, we show that a model can perform better in a monolingual environment when trained with enough monolingual data than in a multilingual finetuning scenario. To investigate the potential of monolingual models, we finetuned mT5 using ISummCorp in both monolingual and multilingual situations and achieved better performance in a monolingual setting. Furthermore, we compare IndicSumm to other multilingual summarization models (XL-Sum and IndicBART) and achieve the state-of-the-art results. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |