Domain-Specific Pretrained Models For Natural Language Generation

Author: Sahil Manoj Bhatt 2018111002
Date: 2023-06-24
Report no: IIIT/TH/2023/96
Advisor:Manish Shrivastava

Abstract

Natural Language Generation (NLG) focuses on the automatic generation of natural language text, which should ideally be coherent, fluent, and stylistically appropriate for a given communicative goal and target audience. The tasks in NLG are varied, whether it be summarization, headline generation, dialogue generation etc., and are also heavily dependent on the domain being considered. Recent research has focused on creating domain-specific datasets and developing domain-specific models to make NLP systems more suited to real-world applications. Training models on data specific to a domain has been observed to yield significantly better results across different domains, whether it be legal, financial or biomedical. However, we observe that there has not been much work done on problems in the tourism domain. The tourism industry is important for the benefits it brings and due to its role as a commercial activity that creates demand and growth for many more industries. Currently, there does not exist any standard benchmark for the evaluation of travel and tourism-specific data science tasks and models. To address this gap, we propose a benchmark, TOURISMNLG, of five natural language generation (NLG) tasks for the tourism domain and release corresponding datasets with standard train, validation and test splits. Moreover, as NLG systems are diversifying across languages, the datasets we create and the models we contribute are also multilingual in nature, which is beneficial for the tourism industry globally. Further, previously proposed data science solutions for tourism problems do not leverage the recent benefits of transfer learning. Thus, in this thesis, we also contribute the first rigorously pretrained mT5 and mBART model checkpoints for the tourism domain. The models have been pretrained on four tourismspecific datasets covering different aspects of tourism. Using these models, we present initial baseline results on the benchmark tasks, that indicate an improvement in performance as compared to the respective models without domain-specific pretraining. Additionally, we consider the problem of summarization for Indian languages, as described in the ILSUM (Indian Language SUMmarization) shared task, which focuses on summarising content from the news domain in three important Indian languages: Indian English, Hindi, and Gujarati. We evaluate the performance of existing pretrained models for the task and present our results and findings. We also talk about steps that must be taken to create high-quality summarization datasets for Indian languages. We hope that the contributions of this thesis will promote active research for natural language generation for travel and tourism, as well as other domain-specific and language-specific tasks and models

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Domain-Specific Pretrained Models For Natural Language Generation

Abstract