Sci-Blogger: A Step Towards Automated Science Journalism

Author: Raghuram Vadapalli
Date: 2019-05-27
Report no: IIIT/TH/2019/47
Advisor:Vasudeva Varma

Abstract

Science journalism is the art of conveying a detailed scientific research paper in a form that non-scientists can understand and appreciate while ensuring that its underlying information is conveyed accurately. It aims to transform jargon-laden scientific articles into a form that a common reader can comprehend while ensuring that the meaning of the article is retained. It plays a crucial role in making scientific content suitable for consumption by the public at large.Recent advances in Deep Learning research and it’s applications in natural language processing have made way to impressive results in Natural Language Generation. We leverage these advances to explore the possibility of their use in journalism, science journalism in particular, as comprehension of scientific content is much harder challenge than most of the other forms of content, like shorthand, which journalists use while writing articles. In this work, we introduce the problem of automated science journalism and present ways to automate some parts of the workflow by automatically generating the ‘title’ of a blog version of a scientific paper. We have built a corpus of 87, 328 pairs of research papers and their corresponding blogs from two science news aggregators and have used it to build Science-Blogger a pipeline based architecture consisting of a two-stage mechanism to generate the blog titles. To demonstrate the models, we built an interactive tool, where a user can give abstract and title of a research paper, which would be processed by our APIs to produce a blog title, along with some relevant information about the model used for the generation. Evaluation using standard metrics indicate viability of the proposed system. Twitter is another social platform where a lot of content of every imaginable category is shared. Naturally, it is also one of the popular media for sharing scientific work apart from blogs. So, we have also experimented with our model to generate tweets, which are roughly of similar length as blog titles. Evaluation on generated tweets also showed promising results. Although significant advances are made in Natural Language Generation, we noticed that evaluation methods didn’t catch up with these advances. Popular metrics like ROUGE and BLEU are based on word overlap. None of these metrics ensure that the generated sentences do not contradict the truth (actual content in the article). It is quite possible that the generated sentences have high overlap with human-written sentences while still contradicting the human-written sentence. Misrepresenting facts is considered a serious problem in journalism. So, a system automating it should explicitly ensure that such misrepresentation and false statements are not generated. Hence, we also formulated a metric to evaluate sentences generated by such automated abstractive natural language generation systems called SSAS: Semantic Similarity for Abstractive Summarization. Ideally a metric evaluating an abstract sys-tem summary should represent the extent to which the system-generated summary approximates the semantic inference conceived by the reader using a human-written reference summary. Most of the previous approaches relied upon word or syntactic sub-sequence overlap to evaluate system-generated summaries. Such metrics cannot evaluate the summary at semantic inference level. Through this work we introduce the metric of Semantic Similarity for Abstractive Summarization (SSAS), which leverages natural language inference and paraphrasing techniques to frame a novel approach to evaluate system summaries at semantic inference level. SSAS is based upon a weighted composition of quantities representing the level of agreement, contradiction, topical neutrality, paraphrasing, and optionally ROUGE score between a system-generated and a human-written summary.

Full thesis: pdf

Centre for Search and Information Extraction Lab

IIIT Hyderabad Publications

Sci-Blogger: A Step Towards Automated Science Journalism

Abstract