IIIT Hyderabad Publications |
|||||||||
|
Learning from Noisy Data for Cross Lingual Text Generation in Low-Resource LanguagesAuthor: Kancharla Aditya Hari 2020121010 Date: 2024-06-22 Report no: IIIT/TH/2024/97 Advisor:Vasudeva Varma AbstractWith Large Language Models (LLMs) and Language models in general becoming a more significant part of our daily content consumption, it is paramount to ensure that languages with fewer resources do not get excluded. As most language models are trained using online data, their performance is usually significantly worse for low-resource languages than languages such as English. This gap in performance leaves speakers of low-resource languages with a handicapped experience of consuming information and participating in online discourse. In recent years, methods have come up that seek to address this resource gap by generating large datasets to enable the training of models for low-resource languages across various tasks. One such task is fact-to-text generation, where cross-lingual generation has gained prominence due to its ability to leverage high-resource languages to augment generation for low-resource languages. However, these works rarely address the noisy nature of synthetically created datasets, which can cause models to hallucinate, reducing their usefulness for factually grounded tasks. This work investigates various methods and ideas that revolve around carefully using noisy datasets. Methods that account for the noisy nature of data can improve the quality of generation of texts without requiring significant modelling or architectural changes. We leverage techniques such as curriculum learning and, in the process, describe various metrics that can be used to quantify data quality. Our work focuses on cross-lingual fact-to-text generation, and thus, we extend our work to generating factually-grounded text. We begin our study by using the XAlign dataset. We investigate how curriculum learning can be used to improve the performance of models for the task mentioned above. We experiment with different curriculum schedules and data-ordering metrics using a sharded curriculum learning framework and delineate how different metrics perform under different schedules. We show that curriculum learning outperforms plain, non-curriculum learning-based training using commonly used metrics. We also introduce a novel metric for ordering data - coverage score, which captures the semantic alignment between the input text and reference text. We show that training with data ordered according to coverage score under a gradually refining schedule results in the best-performing model. Next, we apply these findings to a more challenging setting - long-text generation. To this end, we create a new synthetic dataset using the XAlign dataset and show that previous findings do not apply to this problem setting. We identify the cause of this discrepancy and show that more than a simple curriculum learning framework is needed here. We denoise the training set using different trusted data sources and show that ordering data based on this noise score and a probabilistic sampling-based curriculum improves performance. Finally, we conclude our studies by explicitly focusing on reducing hallucinations in long-text generation. We introduce a modular pipeline-based approach with multiple steps to mitigate hallucination during various stages of training. We show that this approach results in sizeable improvements compared to end-to-end training. We also introduce a new evaluation metric for evaluating texts with divergent references, where accounting for the source is also essential. In summary, this work covers various facets of learning with noisy data for the problem of cross-lingual fact-to-text generation. Synthetically created datasets can bridge the gap between languages, but training models using such datasets is challenging. Through extensive experimentation, we demonstrate several ways to tackle this problem. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |