IIIT Hyderabad Publications |
|||||||||
|
Enriching Structured and Unstructured Knowledge for Low-Resource Languages by Cross Lingual Fact Extraction and Text GenerationAuthor: Bhavyajeet Singh Date: 2023-06-26 Report no: IIIT/TH/2023/95 Advisor:Vasudeva Varma,Manish Gupta AbstractNatural language generation has gained tremendous popularity in recent times primarily due to the advent of large pretrained language models trained on vast amount of data. However, most of this progress has only been limited to few high resource languages like English. Almost all low resource(LR) languages still suffer from the lack of sufficient training data and hence the lack of usable generative models. Furthermore, multiple business scenarios also require an automated generation of descriptive human-readable long text from structured input data, where the source is typically a high-resource language and the target is a low or medium resource language. In this work, we present systems and approaches which can be utilised to ultimately enrich the structured and unstructured content available for low resource languages in the encyclopedic domain. In order to do so, we introduce cross lingual techniques which efficiently utilise the abundant structured data available in high resource languages. We also introduce systems to further enrich this structured data using the information present in the form of natural language text in low resource languages. Firstly we propose novel problem of cross lingual fact to text alignment in order to construct the XAlign dataset for the purpose of cross lingual fact to text generation and fact extraction. We explore several methods to automatically align English facts from Wikidata to sentences from native language Wikipedia. We experiment with approaches accounting for syntactic and semantic matches between the fact and the sentence and propose a two stage pipeline for automated alignment and evaluate it on a manually annotated high quality test set. We also experiment with distant supervision and transfer learning based techniques in order to achieve quality alignment. We use the best approach to create the XAlign dataset which consists of more than half a million aligned (sentence, facts) pairs across 12 Indian languages. Following the construction of the dataset we propose the problem of Cross Lingual Fact Extraction (CLFE). Recent approaches concentrate on automatically enriching large knowledge graphs like Wikidata and DBPedia from text. However a lot of information present in the form of natural text in low resource languages is often missed out. Furthermore, considering the potential use case of utilising structured data for generating content in various LR languages, the CLFE task aims at extracting factual information in the form of English triples from LR Indian Language text. Despite its massive potential, progress made on this task is lagging when compared to Monolingual Information Extraction. We propose strong baselines and an end-to-end generative approach for the CLFE task which achieves an overall F1 score of 77.46. We then introduce and explore the problem of cross lingual fact to text generation (XF2T). We extensively explore multiple approaches for the task and analyse different components of the pipeline. Starting from the choice of pretrained transformer model used, we explore the impact of different continued pretraining strategies. We also show that building cross lingual systems results in better performance than translation based approaches or multiple bi-lingual modes, thus validating the necessity of the proposed problem. We introduce novel techniques like fact-aware embeddings to further improve the generation quality. We demonstrate that these methods produce coherent and precise sentences. Our experiments with the XF2T task lead to the observation that these generative models suffer from hallucination and due to the training setup, are only limited to generating a single sentence at a time. In order to mitigate these limitations, we extend the XF2T task to the problem of Cross-Lingual Fact to Long Text Generation (XFLT). The task involves generating descriptive and human readable long text in a target language from structured input data (such as fact triples) in a source language. XFLT is challenging because of (a) hallucinatory nature of the state-of-the-art NLG models, (b) lack of good quality training data, and (c) lack of a suitable cross-lingual NLG metric. Unfortunately previous work focuses on different related problem settings like monolingual graph to text and has made no specific efforts to handle hallucinations. Hence, we propose a novel solution to the XFLT task which addresses these challenges by training multilingual Transformer-based encoder-decoder models with coverage prompts and grounded decoding. Further, it improves on the XFLT quality by defining taskspecific reward functions and training on them using reinforcement learning. On a dataset with over 64,000 paragraphs across 12 different languages, we compare this novel solution with several strong baselines using a new metric, cross-lingual PARENT. Overall, we work on multiple related tasks aimed at automating the generation of encyclopedic articles and consolidating the factual information available in the form of natural language text from multiple LR languages to enrich structured knowledge bases. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |