IIIT Hyderabad Publications |
|||||||||
|
A Benchmark for Relevance-based Headline Classification and GenerationAuthor: Gopichand Kanumolu 2021701039 Date: 2024-06-22 Report no: IIIT/TH/2024/112 Advisor:Manish Shrivastava AbstractThe task of news headline generation deals with generating a concise summary for a given news article. It is a crucial task in increasing productivity for both the readers and producers of news. Significant progress has been made in automatically generating headlines for widely spoken languages like English. A notable obstacle hindering headline generation in Indian languages is the lack of high-quality data. To address this gap, we present ”Mukhyansh”, a comprehensive multilingual dataset collected from the web for the task of Indian language headline generation. Mukhyansh contains over 3.39 million articleheadline pairs across eight prominent Indian languages: Telugu, Tamil, Kannada, Malayalam, Hindi, Bengali, Marathi, and Gujarati. The news data collected from various websites on the web comprises a mixture of both relevant and irrelevant headlines, including sensational, clickbait, and misleading ones. As a consequence of these irrelevant headlines in scraped news articles, headline generation models often produce sub optimal results. We propose a novel approach centered on relevance-based headline classification to enhance the performance of headline generation models. Relevance-based headline classification deals with categorizing news headlines based on their relevance to corresponding articles. While relevance-based headline classification is well-established in English, its application in low-resource languages like Telugu remains largely unexplored due to a scarcity of annotated data. Our study aims to address this gap by introducing the ”TeClass” dataset, the first-ever human-annotated relevance-based Telugu news headline classification dataset. The proposed dataset contains 78,534 annotations across 26,178 article-headline pairs, making it the largest publicly available dataset of its kind. We experiment with various baseline models on this dataset and provide a comprehensive analysis of the model results. The annotated dataset as well as the annotation guidelines, and models are made publicly available to encourage future research. Furthermore, we utilize this dataset to illustrate the impact of fine-tuning headline generation models on various headline categories, each exhibiting different degrees of relevance to their respective articles. Our empirical results demonstrate that the performance of headline generation models is enhanced when models are fine-tuned on datasets containing highly relevant headlines, despite the smaller quantity of such data. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |