IIIT Hyderabad Publications |
|||||||||
|
Data Creation Pipeline for NLP ApplicationsAuthor: Pavan Baswani 2021701035 Date: 2024-05-03 Report no: IIIT/TH/2024/52 Advisor:Manish Shrivastava AbstractIn the rapidly evolving landscape of Natural Language Processing (NLP) applications, a critical need arises for a versatile data creation pipeline capable of addressing the diverse requirements of various tasks. This thesis introduces a Data Creation Pipeline that significantly enhances the efficiency of data creation for a spectrum of NLP applications, including Abstractive Summarization, Question Answering, Paraphrasing, Legal Named Entity Recognition, Headline Classification, Semantic Relatedness, and Machine Translation correction. This pipeline offers a unified and adaptable solution, streamlining the entire data creation pipeline. The motivation for this pipeline stems from the availability and limitations of existing task-specific tools for open-source usage. While these tools excel in their designated areas, they lack the flexibility to accommodate a wide range of NLP applications. Our pipeline bridges this gap by offering a solution that ensures quality data collection. Key contributions of this work include the development of a systematic and extensible data creation pipeline that begins with the scraping and extraction of pertinent information from news articles. This encompasses not only the article text but also metadata such as publish date, author, category, summary, highlights, headline, sub-headline, tags, images, external links, and miscellaneous details. A noteworthy feature is the pipeline’s capability to derive pre-annotations from instruction-based models. This unique approach transforms the annotation task into a correction task, expediting the annotation process while contributing to the iterative improvement of instruction-based models. The impact of this pipeline on modern data collection methods for NLP applications is profound. By offering a versatile tool that accommodates a myriad of tasks, it streamlines the entire data creation process. The iterative model training based on human instructions not only ensures the development of state-of-the-art models for specific tasks but also signifies a paradigm shift in the way instruction-based models are refined over time. Also, language diversity is a critical aspect of NLP, and our pipeline acknowledges this by supporting a wide range of languages. This inclusivity ensures that the pipeline can be applied globally, fostering linguistic diversity in NLP research and development. The uniqueness of the Data Creation pipeline lies in its adaptability to various NLP tasks, serving as a comprehensive solution for data creation. The iterative improvement guided by human instructions sets it apart from existing pipelines, offering a dynamic and efficient approach to developing high-performance NLP models. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |