IIIT Hyderabad Publications |
|||||||||
|
Improving Text Accessibility using Text Simplification and Conversation DisentanglementAuthor: K V Aditya Srivatsa Date: 2023-07-08 Report no: IIIT/TH/2023/113 Advisor:Manish Shrivastava AbstractAccessibility to text for machines i.e. for downstream tasks, has been improved greatly using mainstream NLP tasks such as Machine Translation and Summarization. In this dissertation, however, we delve into two lesser-explored research areas: Text Simplification (reducing the linguistic complexity of text) and Conversation Disentanglement (extracting individual conversation threads from multi-party dialogues). We pursue text simplification by analyzing the claim of utilizing control tokens in input text to produce simplified versions with precise variations along key text attributes [1]. The effect of control tokens on simplifications was isolated individually and in particular combinations through dedicated models. We show that though control tokens can help mold the outputs closer to the target simplifications, they often cause undesired interactions between control tokens and text attributes as well as within control tokens. This sometimes results in outputs being trivial source text variants that adhere to the control token constraints but do not furnish true simplifications. We offer methods to curb some of these trivial outputs e.g. masking named-entities to shield them from unwanted replacements and to curb the extent of data sparsity using pre-trained input embeddings. Our modifications produce fewer trivial outputs and show better resilience to unseen contexts. As part of our work in conversation disentanglement, we explore the most widely used expertannotated resource in the field: Ubuntu-IRC (IRC) dataset [2]. We question IRC’s utility for training generalizable models for other domains which currently lack similar large-scale, high-quality datasets, e.g., meeting transcripts & movie dialogues. We address this concern by analyzing IRC for potential sources of platform-based specificity. We find that using direct mentions (explicit references to a message’s recipient) in IRC is one such instance of platform specificity that occurs abundantly in the dataset. To measure its impact on model performance, we create Ubuntu-IRC-Hard – a variant of IRC without direct mentions. We show that the performance of past approaches (and proposed baselines) degrades significantly without direct mentions. An in-depth analysis of models is conducted using qualitative metrics based on dialogue properties which reveal brittleness and variations among models across properties. We introduce two methods that leverage these insights to alleviate this brittleness: (i) A weighted-loss formulation to better represent less frequent sub-ranges of dialogue properties in the data, and (ii) An informed-ensembling technique that determines a subset of models best suited for each property sub-range. Both methods surpass SOTA [3] performance and score more consistently across dialogue properties. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |