IIIT Hyderabad Publications |
|||||||||
|
Importance of Facts for Natural Language GenerationAuthor: Tushar Abhishek Date: 2022-11-26 Report no: IIIT/TH/2022/157 Advisor:Vasudeva Varma,Manish Gupta AbstractNatural Language Generation is the task of producing understandable human text from variety of input sources. With the advent of pretrained language models (PLMs), text generation capabilities of current systems have achieved unprecedented heights. Pretrained language models have become the new normal and naturally serve as a backbone architecture for numerous tasks. It has been observed that these models learn various intricacies involved in language understanding and generation during pretraining step. These models are pretrained over large corpus; they also discover the world knowledge through text, some of which are absorbed in model parameters. Although, during finetuning on certain knowledge-intensive tasks (like text coherence, data-to-text, summarization, translation, etc.), it fails to utilize the intrinsic knowledge stored in it’s parameters effectively. In this thesis, we will tackle this problem by incorporating external facts to improve results on two downstream tasks: multilingual facts-to-text generation and text coherence modeling. We observe close association of facts in improving text generation by directly focusing on fact-to-text generation task. The fact-to-text generation is a variant of data-to-text where structured input data is knowledge graph triples. Data-to-text generative system consumes structured input data like tables, databases, knowledge bases, time-series data etc., and produces human-readable text summaries. The first part of the thesis addresses the multilingual fact-to-text generation, where facts are used to generate sentences in multiple languages. The fact-to-text generation requires a dataset where knowledge graph triples are well-aligned with semantically equivalent textual information. Manual creation of such a highquality fact-to-text dataset requires human supervision and is quite challenging to scale. Unsupervised alignment has recently emerged as an active area of research to overcome lack of labelled data and difficulty in domain adaptation. However, not much work has been done for low-resource languages that provide two significant challenges: (1) unavailability of pair of triples and native text with same content distribution and (2) limited Natural language Processing resources in low-resource languages. Hence, we rigorously investigate cross-lingual fact-to-text problem of aligning English structured data with sentences in multiple low-resource languages and develop a new dataset called XALIGN consisting of 0.45M pairs across seven low-resource languages. We propose two different methods of cross-lingual fact-to-text alignment: (a) Non-parametric approaches and (b) Parametric approaches. Additionally, we experimented with strong baseline results by adapting popular natural language generation methods for the cross-lingual fact-to-text task An essential requirement for any system that generates text is the coherence. In the second part of thesis, we address detection of text coherence. A large body of previous work have leveraged entity-based methods, syntactic patterns, discourse relations, and traditional deep learning architectures for text coherence assessment. However, these approaches do not consider factual information present in the documents. Transitions of facts associated with entities across sentences could help better capture the essence of textual coherence. We hypothesise that coherence assessment is a cognitively complex task that requires deeper fact-aware models and can benefit from other related tasks. To demonstrate this, we develop a novel deep learning model that fuses document-level information with factual information. We further enhance the model efficacy by training it simultaneously with Natural Language Inference tasks in multi-task learning setting, taking advantage of inductive transfer between the two tasks. Our experiments with popular benchmark datasets across multiple domains demonstrate that the proposed model consistently outperforms existing methods on synthetic coherence evaluation tasks and two real-world tasks involving predicting varying degrees of coherence. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |