IIIT Hyderabad Publications |
|||||||||
|
Advancing Dravidian Language Processing Beyond the Sentence LevelAuthor: Nikhil E Date: 2024-06-20 Report no: IIIT/TH/2024/85 Advisor:Radhika Mamidi AbstractThe rapid advancements in natural language processing have primarily focused on high-resource languages, leaving low-resource languages, such as those in the Dravidian family, underrepresented in terms of research and resources. To bridge this gap and enable more effective processing of Dravidian languages, this thesis explores the utilization of context beyond the sentence level, delving into linguistic structures at the paragraph and document level, and leverages multilingual training. This is done particularly for the tasks of neural machine translation and multi-class sentence classification. CoPara, the first publicly available paragraph-level n-way aligned corpus for Dravidian languages (Kannada, Malayalam, Tamil, Telugu) and English, was created. This dataset fills a critical gap in multilingual resources for low-resource Dravidian languages and enables research on the impact of paragraph-level information on NMT tasks. The corpus contains aligned paragraphs across English and four Dravidian languages, providing a rich resource for studying cross-lingual phenomena and improving machine translation quality. To demonstrate the utility of CoPara, neural machine translation experiments were conducted by fine-tuning a pre-trained multilingual sequence-to-sequence model on the dataset. The results show significant improvements in translation quality for paragraphs across all language pairs over models trained using sentence-level data, highlighting the potential of leveraging paragraph-level information and multilingual training for enhancing machine translation systems in low-resource settings. An annotated dataset of SEBI legal case files, along with its Indic adaptation, was created. The dataset consists of sentences classified into legally applicable categories and is the first of its kind for Indian legal adjudication orders concerning insider trading. This dataset facilitated the development of a novel system for aspect-based semantic segmentation of legal case files, tailored to different stakeholders like investors, defense lawyers, and adjudicating officers. The system leverages document-level context to improve sentence classification performance, and multilingual training using the Indic adaptation further enhances the results. The results highlight the importance of incorporating context beyond the sentence level and leveraging multilingual training to advance Dravidian language processing. In summary, the thesis provides a comprehensive exploration of utilizing paragraph and document-level context for NMT and sentence classification tasks. The datasets and methodologies introduced pave the way for future research on long-text processing and context-aware NLP tasks in low-resource languages, with potential implications for other language families and domains. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |