IIIT Hyderabad Publications |
|||||||||
|
Cross-Lingual Approaches for Text Generation Tasks in Low-Resource LanguagesAuthor: shivprasad sagare Date: 2023-05-26 Report no: IIIT/TH/2023/52 Advisor:Vasudeva Varma,Manish Gupta AbstractText generation has shown tremendous promise recently, mainly attributed to the use of transformer architecture and the models pretrained on vast amounts of data. Multiple business scenarios today deploy neural-network-based models for natural language generation(NLG) tasks. However, this progress is limited to English and other high-resource(HR) languages, with NLG systems in low-resource(LR) languages far behind in terms of accuracy and fluency of generated text. This is due to several factors, such as lack of training data, lack of robust models supporting native script, and lack of linguistic resources as well. In this work, we extensively study an approach of cross-lingual NLG to tackle these challenges. Cross-lingual NLG implies exploiting the widely available data in HR language to generate the desired text in LR language. We focus on two significant tasks, i.e., fact-to-text and summarization, with a larger goal of generating Wikipedia article text in LR languages. We propose novel ways to build the datasets for the above tasks and also the approaches to generate text in LR languages. Firstly, we propose a novel task of cross-lingual fact-to-text generation(XF2T). Given the Wikidata facts in English, the system is expected to generate a sentence describing these facts in the desired language. To build a parallel dataset to train a model for the same, we explore several methods to link a fact from Wikidata to a sentence from Wikipedia, such as unsupervised, distantly supervised, and zero-shot learning-based approaches. We use the best approach to create the dataset XAlign of 0.55M instances across 12 Indian languages. Further, we implement transformer encoder-decoder and mT5 model as baselines using this dataset. In addition, we also explore the impact of task-specific pretraining, bilingual and monolingual models. We experiment with techniques to improve efficiency, such as structure-aware encoding of facts and fusing role-specific embeddings. We show that these approaches generate fluent and highly accurate sentences. Further, intending to generate longer text, we propose one more novel idea to generate the Wikipedia article section text using summarization. We leverage the citations available for each section on Wikipedia pages and build a parallel dataset for cross-lingual, multi-document, aspect-based summarization in 8 domains and 15 languages. In the first stage, i.e., extractive summarization, we aim to filter relevant sentences from a set of reference articles, for which we use saliency and graph-based methods. We experiment with recent SOTA models mT5 and mBART in the abstractive stage. Despite high noise in the input reference articles, we vi show that the system generates fluent and meaningful outputs. Although, in terms of content coverage and text coherency, models have a lot of scope for improvement. Overall, we work on various methods using cross-lingual NLG to advance the datasets and models in LR languages. We hope this work will boost more research in these critical areas in the future Full thesis: pdf Centre for Computational Natural Sciences and Bioinformatics |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |