Automated Credibility Assessment of Web Page

Author: Shriyansh Agrawal
Date: 2019-07-31
Report no: IIIT/TH/2019/99
Advisor:Raghu Reddy

Abstract

With more than a trillion web pages, there is a plethora of content available for consumption. Search Engine queries invariably lead to overwhelming information, parts of it relevant and others irrelevant. Often the information provided can be conflicting, ambiguous, and inconsistent, which can have serious consequences for people who increasingly rely on web sources for information related to security, health, academia, etc. Prior Research stresses upon the idea that traditional Search Engine Optimization techniques tend to focus on making top-ranked results more and more relevant and mostly depend on the user’s personal information and site popularity. Regardless, people often use two divergent terminologies- credibility and popularity, interchangeably. Credibility, an important quality characteristic of web pages is questionable in many cases and tends to be non-uniform. Credibility refers to the degree to which a website could be relied upon. Principally, credibility can be thought of as a compass for guiding us safely through a world of uncertainty, risk and moral hazards. Novice users who use search engines do not know how to start and lack sufficient knowledge for finding the best possible results. For a novice user, surface features such as fonts, colour, images and other layouts of the web page create the first impression on credibility [19]. While for most of the regular users of the internet, content relevance, information source, evolution of content and other fine-grained features of credibility decide web page usage. In past, researchers have proposed approaches for credibility assessment and enumerated features influencing the credibility of web pages. Assessment of few of those features can be automated using existing literature and contemporary knowledge; while others still need human intelligence. Web have been expanding since its inception, with which various kinds of web pages are emerging, categorized as genre of web page, for example– Help, Article, Discussion, etc. Depending on the genre of web page, the importance of credibility features such as web page date time modified, grammar, image to text ratio, in and out links, and other web page features may differ for assessment. Therefore, credibility without factoring genre of web page can lead to incorrect assessment. We conducted a crowd-source survey over multiple channels, where we asked participants to mark individual importance of web page elements(features) across different web genres, on a Likert scale of 1 to 4. The surveyed results implied that the importance of each feature vary across genres, which therefore supported our argument about the need for genre-aware credibility assessment of a web page. In this work, we propose an automated approach for credibility assessment of web page, where genre is also identified within to give human experts alike assessment results. We design a framework (called W EBCred) based on our proposed approach which accommodates various individual structures like – crawling, genre classification, normalization, scoring, etc. and keep them independent from each other to facilitate further extensib lilty. The proposed framework allows the addition of new genres, features and alter weightages providing flexibility for user intervention. To validate our proposed approach, we developed an Open-Source tool, which is capable of genre identification along with extraction and normalization of selected feature instance values to calculate a credible score (called GCS) of every web page. Few of these features were new and their extraction methodologies are defined by ourselves, as they are not explicit. Our tool is fully automated, such that it assess the Genre Credibility Score (GCS) of a given web page without any human aid. The source code [9] of developed tool is available over Github for further extension, and is deployed [7] on web for testing. We carried out extensive experiments to establish the effectiveness of our approach. We experimented our approach with ‘Information Security’ dataset having 8,550 URLs with 171 features across 7 genres. This dataset has been used for crowdsourced survey, training genre classification model and normalizing extracted feature instance values. The supervised learning algorithm, Gradient Boosted Decision Tree classified genres with 88.75% testing accuracy over 10 fold cross-validation, which overcome the current benchmark (about 80%). The calculated GCS based on identified genres by our trained model, correlated 69% with crowdsourced Web Of Trust (W OT ) score and 13% with algorithm based Alexa ranking for selected ‘Information Security’ web pages. As a further validation of our trained model and overall approach, we tested our trained model on separate ‘Health’ domain web pages, which correctly classify genres with 82.26% accuracy. Further the calculated GCS for ‘Health’ web pages correlates 59% with W OT and 23% with Alexa ranking. Cross domain validation of our approach ensures it can be adapted to other application domains like education, human sciences, etc. Correlation of WOT and Alexa scores is 13% for ‘Information Security’ and 23% for ‘Health’ domain web pages, which advocates for disparity in classification and correlation results of cross domain experiment. W EBCred encompasses genre as a selection criteria to get importance of individual features that are not being used by other algorithms such as PageRank, as the credibility of the web page is not their primary motive but popularity. It can be observed that GCS score has better correlation with W OT in both domains confirming that our approach aligns more with the human way of web page assessment.

Full thesis: pdf

Centre for Software Engineering Research Lab

IIIT Hyderabad Publications

Automated Credibility Assessment of Web Page

Abstract