IIIT Hyderabad Publications |
|||||||||
|
Headline Generation for Indian LanguagesAuthor: LOKESH MADASU 2021701042 Date: 2024-06-29 Report no: IIIT/TH/2024/105 Advisor:Manish Shrivastava AbstractIn the field of Natural Language Processing (NLP), the abundance of online content presents both opportunities and challenges. The internet hosts a wealth of information, encompassing diverse topics and languages, ranging from news articles to blog posts. However, navigating through this sheer volume of content can be overwhelming, leading to information overload and difficulty in identifying the most relevant content. As a result, there is a growing demand for efficient methods to distill complex textual information into concise and informative summaries. One key approach to addressing this challenge is through headline generation. Headline generation within the domain of NLP holds immense significance, particularly in today’s era of short attention spans and overwhelming information flow. The ability to quickly grasp the key points of a document can significantly enhance user experience and facilitate knowledge dissemination, especially across diverse linguistic communities. Despite considerable advancements in headline generation for widely spoken languages like English, challenges persist in generating headlines for low-resource languages, such as the rich and diverse Indian languages. One major obstacle hindering headline generation in Indian languages is the limited availability of high-quality data. To address this crucial gap, we introduce Mukhyansh, an extensive multilingual dataset tailored for Indian language headline generation. Mukhyansh comprises over 3.39 million article-headline pairs collected from the web, covering eight prominent Indian languages: Telugu, Tamil, Kannada, Malayalam, Hindi, Bengali, Marathi, and Gujarati. This thesis presents a comprehensive evaluation of several stateof-the-art baseline models on the Mukhyansh dataset. Through empirical analysis, we demonstrate that Mukhyansh surpasses existing models, achieving an impressive average ROUGE-L score of 31.43 across all eight languages. However, the presence of irrelevant headlines in scraped news articles results in the sub-optimal performance of headline generation models. We propose that relevance-based headline classification can greatly aid the task of generating relevant headlines. Relevance-based headline classification involves categorizing news headlines based on their relevance to the corresponding news articles. While this task is well-established in English, it remains under-explored in low-resource languages like Telugu due to a lack of annotated data. To address this gap, we present “TeClass”, the first-ever human-annotated relevance-based news headline classification dataset for Telugu, containing 78,534 annotations across 26,178 article-headline pairs. We use this data set to demonstrate the impact of fine-tuning headline generation models on various categories of headlines (with varying degrees of relevance to the article) and prove that the task of relevant headline generation is best served when the models are fine-tuned on a dataset containing highly relevant headlines, even though the size of highly related data is less in number. Our work highlights the effectiveness of Mukhyansh and TeClass in advancing headline generation and classification research for Indian languages and underscores its potential to facilitate further developments in this domain. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |