Evaluating the Robustness of Deep learning Models for NLP

Author: Rishabh Maheshwary
Date: 2023-03-28
Report no: IIIT/TH/2023/104
Advisor:Vikram Pudi

Abstract

The significance of deep neural networks (DNNs) has been well established through its success in a variety of tasks. However, recent studies have shown that DNNs are vulnerable to adversarial examples — inputs crafted by adding small perturbations to the original input. Such perturbations are almost imperceptible to humans but deceive DNNs thus raising major concerns about their utility in real world applications. Although, existing adversarial attack methods in NLP have achieved high success rate in attacking DNNs, they either require detailed information about the target model, training data or need a large amount of queries to generate attacks. Therefore, such attack methods are not realistic as it does not reflect the types of attacks that can be encountered in the real world and are less effective as attacks relying on model information and excessive queries can be easily defended against. In this thesis, we address the above mentioned drawbacks by proposing two realistic attack settings — hard label black box setting and limited query setting. Next, we propose two novel attack methods that crafts plausible and semantically similar adversarial examples in the above settings. The first attack method uses a population based optimization procedure to craft adversarial examples in the hard label black box setting. The second method is a query efficient method that leverages word attention scores and locality sensitive hashing to find important words for substitution in the limited query setting. We benchmark our results across the same search space used in prior attacks so as to ensure fair and consistent comparison. To improve the quality of generated adversarial examples, we propose an alternative method that uses masked language model to find candidate words for substitution by considering the information of both the original word and its surrounding context. We demonstrate the efficacy of each of our proposed approach by attacking NLP models for text classification and natural language inference task. In addition to that we use adversarial examples to evaluate the robustness and generalization of recent math word problem solvers. Our results showcase that DNNs for the above tasks are not robust as they can be deceived by our proposed attack methods in a highly restricted setting. We conduct human evaluation studies to verify the validity and quality of generated adversarial examples.

Full thesis: pdf

Centre for Data Engineering

IIIT Hyderabad Publications

Evaluating the Robustness of Deep learning Models for NLP

Abstract