IIIT Hyderabad Publications |
|||||||||
|
An Information Loss based Framework for Document SummarizationAuthor: Chandan Kumar Date: 2009-06-30 Report no: IIIT/TH/2009/3 Advisor:Vasudeva Varma AbstractWith vast amount of data available electronically, there has been an ever increasing need to readily sort out and extract only the chief and pertinent sections from the data sources to present to the user. Time and space being critical constraints in this electronic era, it is indispensable to provide a mechanism to locate and browse required information quickly to avoid information overload. Automatic multi-document text summarization fulfills this need by presenting user with desired information in the form of a quick summary. Given a collection of documents related to same topic, the goal of multi document summarization system is to generate a short and concise summary that can be read in lieu of the original document collection. In multi-document summarization, sentence extraction is a critical phase in the formation of useful summaries. where the task is to pick a subset of sentences from the document cluster to form the summary giving an overall sense of the document’s content. Developing a principled sentence extraction mechanism that also performs empirically well is a big challenge in itself. This thesis presents a new sentence extractive summarization framework based on Information Loss. We treat summarization as a decision making problem. Given a set of documents, either to a human or system, the selection of few sentences as a representative to whole document set is a critical decision making problem. As per decision theory we derive a general extraction mechanism for picking sentences based on an ascending order of the expected risk of information loss. We propose to use an intrinsic loss function to compute this information loss and to make a decision on picking a sentence as a part of summary. Sentences and documents are considered as different text units and are represented by probabilistic language models to estimate their distributions. In this inferential setting we use entropy based intrinsic loss function (relative entropy) to measure the discrepancy between the sentence and document model. Relative entropy loss function measures how bad a sentence distribution is in modeling the documents distribution. By doing this we are able to capture the amount of information loss in picking a sentence to represent the whole document cluster. The selection of a sentence in summary is not by a set of features but only the measure of loss. A large document set is divided into smaller text units (sentences), each of which tries to approximate the document set variationally, yielding an overall variational approximation. Sentences which are the candidates of summary act as a surrogate for document set in a larger inference process. This decomposition strategy leads us directly to a new sentence extraction algorithm. With a simple redundancy identification and text reformulation mechanism, we come up with a lightweight summarizer to generate more informative summaries. The proposed algorithm generates the extracts on the fly without extensive computation or training which seem to be used in various state of the art algorithms. Furthermore, we consider different information theoretic divergence measures and loss functions to estimate loss between sentence and document distribution, and analyze their performance. In order to evaluate the performance of our approach, we have used DUC (Document Understanding Conference) and MSE (Multi-Lingual Summarization Evaluation) dataset that have been widely used in recent document summarization evaluations. We have applied ROUGE (Recall-Oriented Understudy for Gisting Evaluation) as the automatic summary evaluation metric which is the standard way of evaluation of summaries. It essentially calculates n-gram overlaps between automatically generated summaries and previously-written human summaries. A high level of overlap indicates a high level of shared concepts between the two summaries. Our overall results are the best reported on the DUC-2004 summarization task for all three metrics ROUGE-1 ROUGE-2 and ROUGE-SU4, and are the best, but not statistically significantly different from the best system in MSE-2005. Results on DUC-2007 dataset supports our results on DUC-2004 and MSE-2005. Our system is also substantially simpler than the previous best systems. Furthermore, we go beyond the traditional notion of generic relevance and incorporate a user factor as sentence extraction criteria. Here we treat summarization process as not only a function of the input text but also of its reader. We believe that a good summary should change in accordance to preferences of its reader. For this purpose we model the user in the proposed information loss based framework to extract user specific personalized summaries by creating web based profiles using the personal data available online. To evaluate personalized summaries, a controlled user-centered qualitative evaluation was carried out on news articles of science and technology domain. The results indicate better user satisfaction with personalized summaries compared to generic summaries. Full thesis: pdf Centre for Search and Information Extraction Lab |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |