Enhancing Text Summarization for Indian Languages: Mono, Multi and Cross-lingual Approaches

Author: ASHOK URLANA 2020701023
Date: 2023-07-11
Report no: IIIT/TH/2023/111
Advisor:Manish Shrivastava

Abstract

The internet serves as a vast repository of information covering a diverse array of topics, ranging from blogs and articles to websites. However, it is important to note that not all of this information is valuable or relevant. Navigating through the plethora of content in order to gain a comprehensive understanding of a particular topic can be a daunting and time-consuming task. Furthermore, it is all too common to invest time in reading content that ultimately proves to be unimportant or irrelevant. Given the inherent limitations of the human cognitive capacity to process large quantities of information, concise and relevant summaries are highly sought after in order to efficiently and effectively comprehend complex subjects. Summarization is a computational task that condenses textual information into a concise version by including only the most essential and relevant information. There are two main approaches to summarization: extractive and abstractive. Extractive summarization involves selecting sentences based on their importance, while abstractive summarization involves introducing new words or phrases in the summary. Document summarization has been studied for over three decades by the NLP community. However, progress in Indian language summarization has been limited due to the lack of high-quality datasets and benchmark models, which has motivated us to work towards developing resources and benchmarks for Indian languages. In this thesis, we have developed text summarization resources for Indian languages in three different settings: mono-lingual, cross-lingual, and multi-lingual. The initial focus of this thesis is on mono-lingual summarization, specifically creating a high-quality dataset for the popular south Indian language, Telugu. We propose a pipeline that crowd-sourced summarization data and then aggressively filtered the content via: automatic and partial expert evaluation. Using the pipeline, we create a high-quality Telugu abstractive summarization dataset (TeSum). The dataset consists of 20,329 document-summary pairs, which were created by 347 annotators and evaluated by 3 raters. We carefully designed annotation guidelines that consider the parameters of Relevance, Readability, and Creativity. Additionally, we compared our dataset with existing Telugu summarization datasets. By training a summarization system on multiple languages, the system can learn to represent concepts in a shared space, regardless of the language in which they are expressed. This shared representation learning can be useful for transfer learning, as it enables the model to apply knowledge gained from one language to another language. To achieve this, we perform the multi-lingual and cross-lingual summarization for Indian languages. For multi-lingual summarization, we utilized the Indian Language Summarization (ILSUM) dataset to create baselines, which includes Hindi, Gujarati, and Indian English. We test the proposed filters on ILSUM data to perform the quality assessment. We conducted experiments with different pre-trained sequence-to-sequence models to identify the best-performing model for each language. Our work also involved an in-depth analysis of the impact of k-fold cross-validation when dealing with limited data. Additionally, we performed experiments using a combination of the original and filtered versions of the data to assess the effectiveness of the pre-trained models. We present the PMIndiaSum, a new cross-lingual and highly parallel summarization dataset for languages in India. The dataset covers 4 language families, 14 languages, and 196 language pairs. We detail the approaches taken to derive this dataset, including data acquisition, cleaning, quality assurance, and inspection. In addition, we publish benchmarks for various methodologies, such as fine-tuning pre-trained language models and summarization-and-translation. Experimental results suggest that the provision of multilingual data enhances cross-lingual summarization between Indian languages. Furthermore, this thesis also delves into multi-perspective scientific document summarization. Our objective is to develop a model that can generate a generic summary encompassing various aspects covered by multiple reference summaries of a scientific document. We describe the different pre-trained models used in this task, as well as the challenges encountered during the process.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Enhancing Text Summarization for Indian Languages: Mono, Multi and Cross-lingual Approaches

Abstract