Predictive Analytics for News Articles Using Wikipedia

Author: Navya Yarrabelly
Date: 2018-07-23
Report no: IIIT/TH/2018/52
Advisor:Kamalakar Karlapalem

Abstract

Our news are saturated with various statements pertaining to future actions, employment, stock market, economic growth projections, global financial crisis, policies, globalization, climate changes, technologies, etc. Such future trends could have potential impact on our lives, business, etc. In this work, we consider a set of articles or reports by journalists or others, wherein they predict or promise something about future. The problem we address is to determine the credibility of the authors based on the predictions coming out to be true. The two specific problems we address in this thesis are automatically extracting the predictions from the articles and annotating with various prediction attributes. And then, we determine the truth of these predictions, using Wikipedia as a credible source to retrieve relevant facts which can ascertain the validity of the predictions. We estimate that large number of news articles contain references to future. The reference is detected through the notion of predictive statements (phrases). Distinguishing such predictive statements from factual statements in news articles is important for most applications such as fact checking, opinion mining, future trend analysis, etc. In this thesis, we present our approach to the problem of automatically extracting future-related information by solving two sub-problems. The first sub-problem is labeling a sentence as predictive or factual. In addition to extracting the predictions, we address the tasks of clausal scope resolution and dis-embedding linguistic peripheral clauses with respect to the predictive clause in a sentence. To solve these problems, we extract all the clauses of a given sentence and classify each of the clauses as predictive or factual. We then use a machine learning based approach to disambiguate the clause labels by using the clausal dependency relations and label the sentence. ’. We present the results obtained by our predictions extraction methods on two datasets collected and annotated from economics, politics and sports domain. Our system attained an accuracy of (0.85-0.89) on both the datasets. In addition to extracting predictive statements, we further attempted to validate the truth of these predictions. We proposed an architecture to integrate various unscaled Natural Language Processing resources to incorporate background knowledge into our modules, and built an end to end system for automated predictions validation(APV) by extracting future speculations and predictions from news articles and social media. We proposed heuristics to identify core prediction triplets and a retrieval model to extract facts relevant to the prediction. We demonstrated relation alignment based entailment methods for validating the truth of predictions against relevant facts obtained from credible sources(Wikipedia). We experimented and validated our methods using two datasets for the predictions validation task. Dataset 1: Rio Olympics dataset from which we considered 28 news articles from 6 different sources to extract 97 predictions from these articles and the range of credibility scores(F-scores) for these articles are (0.57-0.71). Dataset 2: Obama Promises dataset where we collected 300 campaign promises of Barack Obama. To make our case, we compared the trustworthiness of a person based on their promises made and the promises kept. We worked on Barack Obama’s campaign promises dataset and credibility identified by our system is that 137 out of 300 promises are kept, compared to the actual result(manual) of 204 out of 300 promises identified as kept by Politifact website.

Full thesis: pdf

Centre for Data Engineering

IIIT Hyderabad Publications

Predictive Analytics for News Articles Using Wikipedia

Abstract