Leveraging Syntactic Information for Coherent and Comprehensible Summarization

Author: Litton J Kurisinkel
Date: 2021-09-15
Report no: IIIT/TH/2021/133
Advisor:Vasudeva Varma

Abstract

Text summarization is a natural language processing problem which has been investigated by the NLP community for half a century. In the era of information explosion, the community has intensified research for more sophisticated methods for automated text summarization. Attempts were made in the past to frame extractive and abstractive techniques for multidocument summarization. Extractive techniques select a subset of sentences which can approximate the summary of the input corpus of documents, while abstractive summarization techniques construct a semantic representation and are expected to generate the summary in its own learnt writing style. Extractive techniques create an intermediate representation for the target text, capturing the key textual features. Possible approaches for intermediate representation are Topic Signatures, Word frequency count, Latent Space Approaches using Matrix Factorizations, or Bayesian approaches. These intermediate representations are then used to assign scores for individual linguistic units within the text and select a subset of linguistic units which maximizes the total score as the summary of the target text. The mathematical scoring function for the summary is generally composed of components to quantify topical coverage and topical diversity. They report the accuracy in terms of a measure called the ROUGE score. Relatively less work is available on abstractive multi-document summarization in the past. Most of them utilise sub- syntactical structures which are directly extracted from input documents to generate summary sentences. Sub syntactical structures such as phrases are reorganized to create summary sentences using a method which can ensure relevant topical coverage, topical diversity and gramaticality. They also incorporate means to ensure factual accuracy so that sentences generated by the abstract summarization system are factually correct with respect to original corpus. Despite all the attempts to improve summarization in easily quantifiable dimensions such as topical coverage and diversity, a summary needs to be improved in other qualitative dimensions such as comprehensibility and coherence to match with a well-crafted human summary. Coherence represents the presence of inter-sentence structural relationships and topical continuity. Comprehensibility denotes how much a sentence is comprehensible without its context in the source document during an extractive summarization process.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Leveraging Syntactic Information for Coherent and Comprehensible Summarization

Abstract