New Frontiers for Machine Unlearning

Author: Shashwat Goel 2019111006
Date: 2024-05-23
Report no: IIIT/TH/2024/54
Advisor:Ponnurangam kumaraguru

Abstract

Machine Learning models increasingly face data integrity challenges due to the use of large-scale training datasets drawn from the Internet. We study what model developers can do if they detect that some data was manipulated or incorrect. Such manipulated data can cause adverse effects like vulnerability to backdoored samples, systematic biases, and in general, reduced accuracy on certain input domains. Machine Unlearning, traditionally studied for handling user data-deletion requests to provide privacy, can address these by allowing post-hoc deletion of affected training data from a learned model. Achieving perfect unlearning is computationally expensive; consequently, prior works have proposed inexact unlearning algorithms to solve this approximately as well as evaluation methods to test the effectiveness of these algorithms. In this thesis, we first outline some necessary criteria for evaluation methods and show no existing evaluation satisfies them all. Then, we design a stronger black-box evaluation method called the Interclass Confusion (IC) test which adversarially manipulates data during training to detect the insufficiency of unlearning procedures. We also propose two analytically motivated baseline methods (EU-k and CF-k) which outperform several popular inexact unlearning methods. We demonstrate how adversarial evaluation strategies can help in analyzing various unlearning phenomena which can guide the development of stronger unlearning algorithms. Next, we study the practical constraint that model developers may not know all manipulated training samples. Often, only a small, representative subset of the affected data is flagged. We formalize “Corrective Machine Unlearning” as the problem of mitigating the impact of data affected by unknown manipulations on a trained model, possibly knowing only a subset of impacted samples. We demonstrate that the problem of corrective unlearning has significantly different requirements from traditional privacyoriented unlearning. We find most existing unlearning methods, including the gold-standard retrainingfrom-scratch, require most of the manipulated data to be identified for effective corrective unlearning. However, one approach, SSD, achieves limited success in unlearning adverse effects with just a small portion of the manipulated samples, showing the tractability of this setting. We hope our work spurs research towards developing better methods for corrective unlearning. Finally, we demonstrate the use of unlearning in reducing the risk of Large Language Models assisting malicious use in the creation of bioweapons and cyberattacks. Adaptations of existing state-of-the-art unlearning techniques fail on this task, probably due to complexities introduced by not having access to training data that leads to such capabilities. We discuss Contrastive Unlearning Tuning (CUT), a Representation Engineering based unlearning method that steers models towards novice behaviour on potentially harmful dual-use knowledge, while retaining general model capabilities. We design a probing evaluation which shows CUT succeeds in removing this knowledge even from the internal layer representations of LLMs. Overall, this thesis attempts to extend the frontiers of unlearning from user-privacy applications to debiasing, denoising, removing backdoors, and removing harmful dual-use capabilities. We highlight the shortcomings of privacy-oriented unlearning methods and formulations in achieving these goals. We hope our work offers practitioners a new strategy to handle challenges arising from web-scale training, and post-training line of defense towards ensuring AI Safety.

Full thesis: pdf

Centre for C2S2-Precog

IIIT Hyderabad Publications

New Frontiers for Machine Unlearning

Abstract