Bridging the Gap between Pre-Training and Fine-Tuning for Improved Text Classification

Author: Tanish Lad
Date: 2023-06-08
Report no: IIIT/TH/2023/50
Advisor:Radhika Mamidi

Abstract

Pre-training language models like BERT and then fine-tuning them on downstream tasks has demonstrated state-of-the-art results in various natural language processing tasks. However, pre-training is usually independent of the downstream task, and previous works have shown that this pre-training alone might not be sufficient to capture the task-specific nuances. We propose a novel way to tailor a pretrained BERT model for the downstream task via task-specific masking prior to the standard supervised fine-tuning. Our approach involves first creating a word list specific to the task at hand. For example, if the task is sentiment analysis, we gather a small set of words that represent positive and negative sentiments. If the task is hate speech detection, then we gather a small set of words that represent hate words. If the task is humor detection, then we gather a small set of words that represent humorous words. Next, we use word embeddings to measure the importance of each word in the task using the word list, which we call the word’s task score. Based on the task score, we assign a probability of masking to each word. This probability reflects the likelihood of masking the word during the fine-tuning process. We experiment with different masking functions, including the step function, the linear function, and the exponential function, to determine the best approach for assigning the probability of masking based on the word’s task score. We then use this selective masking strategy to train the BERT model on the masked language modeling (MLM) objective. During MLM training, rather than randomly masking 20% of the input tokens, we selectively mask these input tokens based on their assigned probability of masking, calculated using the word’s task score. Finally, we fine-tune the BERT model on different downstream binary and multi-class classification tasks, such as sentiment analysis, hate speech detection, formality style detection, named entity recognition, and humor detection. Our experiments show that our selective masking strategy outperforms random masking, indicating its effectiveness in fine-tuning the pre-trained BERT model for specific tasks. Overall, our approach provides a more targeted and effective way of fine-tuning pre-trained language models for specific tasks, by incorporating task-specific knowledge between the pre-training and the fine-tuning stages. By selectively masking input tokens based on their importance, we are able to better capture the nuances of a particular task, leading to improved performance on various downstream classification tasks

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Bridging the Gap between Pre-Training and Fine-Tuning for Improved Text Classification

Abstract