References as Building Blocks: Investigating their Significance in Encyclopedic Text Generation

Author: Dhaval Taunk 2021701028
Date: 2023-09-23
Report no: IIIT/TH/2023/142
Advisor:Vasudeva Varma

Abstract

Automated text generation for low resource (LR) languages is a critical area of research because of lack of contributors to encyclopedic texts, notably on Wikipedia. The majority of the work done so far on Wikipedia text generation has concentrated on creating English-only Wikipedia articles by summarizing English reference articles. Monolingual text generation is unable to address this issue for low-resource languages due to the lack of reference materials. To start addressing these problems, we propose a benchmark dataset called XWikiRef that consists of ∼69K in Wikipedia articles from five different domains and eight different languages. Utilizing this dataset, we train a two-stage system that outputs a section-specific LR summary from an input of a set of citations and a section title. One crucial aspect of content organization is the creation of article outlines, which summarize the primary topics and subtopics covered in an article in a structured manner. We introduce a pipeline called XOutlineGen, which generates cross-lingual outlines for encyclopedic texts from reference articles. XOutlineGen uses the XWikiRef dataset, which consists of encyclopedic texts generated from reference articles and section titles. Our pipeline employs this dataset to train a two-step generation model, which takes the article title and set of references as inputs and produces the article outline. Commonsense question-answering (QA) methods combine the power of pre-trained Language Models (LM) with the reasoning provided by Knowledge Graphs (KG). A typical approach collects nodes relevant to the QA pair from a KG to form a Working Graph (WG) followed by reasoning using Graph Neural Networks (GNNs). This faces two major challenges: (i) it is difficult to capture all the information from the QA in the WG, and (ii) the WG contains some irrelevant nodes from the KG. To address these, we propose GrapeQA with two simple improvements on the WG: (i) Prominent Entities for Graph Augmentation identifies relevant text chunks from the QA pair and augments the WG with corresponding latent representations from the LM, and (ii) Context-Aware Node Pruning removes nodes that are less relevant to the QA pair. We evaluate our results on OpenBookQA, CommonsenseQA and MedQA-USMLE and see that GrapeQA shows consistent improvements over its LM + KG predecessor (QAGNN in particular) and large improvements on OpenBookQA. We utilize the idea of relevance scoring from this work in our next work which is called XWikiGen for performing neural extractive summarization With this study, we propose XWikiGen, a task of cross-lingual multi-document summarization of text from numerous reference articles written in different languages to produce Wikipedia-style material. The suggested approach is built on the novel idea of using neural unsupervised extractive summarization to roughly select salient information and then using a neural abstractive model to produce the section-specific text. Extensive experiments have revealed that multi-domain training generally outperforms a multi-lingual and multi-lingualmulti-domain perform best, even better then previous two settings. Overall, we propose a new dataset called XWikiRef for the task of encyclopedic text generation, a 2 stage pipeline XOutlineGen to generate article outline from references and a cross-lingual multi-document summarization based 2 stage pipeline XWikiGen to generate Wikipedia style text. Along with these, we also explore the idea of relevance scoring first in the domain of question answering with reasoning (GrapeQA) and then in the context of unsupervised extractive summarization.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

References as Building Blocks: Investigating their Significance in Encyclopedic Text Generation

Abstract