Computational Approaches for Identifying Sensational Soft News

Author: Vijaysaradhi Indurthi
Date: 2020-04-25
Report no: IIIT/TH/2020/25
Advisor:Vasudeva Varma,Manish Gupta

Abstract

Internet and social media have fundamentally transformed the landscape of news consumption. Consumption of news from the social media is on increase and the consumption through the print media is on decline. The practice of digital journalism is on the rise, thanks to the low barrier of entry, almost zero distribution costs and availability of mobile computing. There has been a proliferation of news publishers and distributors online in the recent times. As more and more audience are consuming news from the internet, it is very important for the news publishers to engage the users on their websites to maximise the revenue, a majority of which is generated from the advertisements on the page. This also caused a raise in the concept of yellow journalism and tabloid journalism, that emphasizes sensational crime stories, gossip columns about celebrities and sports stars, extreme political views and opinions from one perspective, junk food news, and astrology. In fact, aggregating soft news has become a prospective and lucrative business compared to the business of aggregating only the hard news. Many of the publishers publish a mix of both soft news and hard news to maximise their revenue. Sensationalism is often the most common tactic used by the news editors to increase the engagement of the users is sensationalism. The topics and the news stories published are worded in a way to excite the readers and lure the user to click on the link of the story. This type of reporting by the editors can portray the news in a biased manner and may sometimes manipulate the truth of the story. A part of the sensationalism tactic is to report news on very insignificant matters and assigning undue importance to them. Reporting on controversial topics, deliberate omission of facts, using exaggerated words to seek attention, exaggerating trivial events out of proportion, reporting on the content which is insignificant and irrelevent to the audience are some of the tactics used to achieve higher page impressions in the online news media. News editors stoop to low journalistic reporting standards to maximize their engagement and the revenue, at the expense of the quality of the reporting. Such kind of reporting may lead to specific categories of news like Clickbait, Bizarre news, Fake news, Hoax news, Misleading news, Satire and Funny news and Hyperpartisan news. Mimics and clones of popular news outlets are also common. Some categories of the news like Bizarre News, Satire and Funny news are entertaining and harmless. However some categories like, fake news in specific, might cause problems ranging from financial loss, disturbances in society and may cause loss of human life in extreme cases. Clickbaits in the guise of providing entertainment may waste a lot of precious time of the readers and may ultimately cause disappointment to the readers. Fake health remedies disguised in the form of health tit-bits can harm the readers by compromising their health. There are both positive and the negative sides of the sensational softnews. Therefore, it is extremely important to automatically identify, evaluate and appraise the sensational content in the news media. In this thesis we consider the problem identifying sensational soft news content. In specific, we study two types of sensational content - bizarre news and clickbaits. We selected these two categories of the news as these news is usually sensationalized to increase the engagement of the readers. Identifying the sensational soft news by looking at the title is challenging because the titles are usually short text and they are often written in convoluted ways which requires requires high-order semantic understanding, usually with support of facts from some knowledge base. We do a thorough study of both of these types of content and build computational models to categorise the content as hard vs soft news category (like bizarre or clickbait) and also rank its intensity. We selected bizarre and clickbait as the broad categories as both of them have the characterstics of sensationalism in them. We begin the problem of bizarre news identification. We compile a dataset of bizarre news and normal news. We do a deep analysis of bizarre news and regular news and identify handcrafted features like semantic, lingustic and blah blah features for bizarre news detection. We also proceed to categorize the bizarre news into different categories. We then consider the problem of clickbaits. Clickbaits are ubuquitous. In this work, we explore multiple techniques to identify the clickbaits in online social media. We collect and annotate a huge dataset of clickbaits, which is many times larger than any of the existing datasets and use it in a semi supervised approach for doing clickbait classification. We explore various word and sentence embeddings for clickbait identification. We then consider the problem of clickbaits in Spanish. Specifically, we chose to build models to identify clickbaits in Spanish language, one because of the popularity of the language, and other due to the ease of availability of datasets and tools to accomplish the task. We collect a dataset consisting of a decent number of clickbaits and non-clickbaits in Spanish language. We use the latest state-of-theart transformer languge and finetune them on the spanish Clickbait dataset to build models which can categorise spanish headlines into clickbait or not. We then address the problem of predicting the clickbait intensity of a news headline. Not all clickbaits are equally clickbaity. While some titles can be rated as highly clickbaity, some titles can be low in clickbait rating. We participate in a global clickbait intensity prediction challenge, the Clickbait Challenge organized by Bauhaus-Universitat Weimar. Our models outperform all the existing state-of-the-art ¨ models. We use the transformer language models coupled with traditional machine learning regressor models to achieve the best performance in the challenge.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Computational Approaches for Identifying Sensational Soft News

Abstract