IIIT Hyderabad Publications |
|||||||||
|
Study of Argument Structure in Parliamentary DebatesAuthor: Venkata Krishna Rohit Sakala Date: 2019-04-26 Report no: IIIT/TH/2019/30 Advisor:Navjyoti Singh,Manish Shrivastava AbstractThe last decade has witnessed digitization of much governmental data. With the increasing usage of internet, tons of data is being digitized which includes parliamentary proceedings as well. A major downside of this digitization is that the data is in an unstructured format. There is a need to structure the data for use in machine learning, data mining and NLP. Much work has been done on parliamentary data such as Hansard, American congressional floor-debate data on various aspects but there isn’t enough work on the argumentation aspects of it using computational methods. Text summarization of the political discourse, particularly in the parliamentary proceedings is an area of research that is rela- tively explored less. Also, argumentation is relatively a new area of research which has many potential applications and can be used as a feature in machine learning models. In this thesis, we work on these rough edges: creating structured data of parliamentary debates, creating an annotated dataset of arguments on the debates and then propose a novel approach for summarizing these debates using argumentation theory. Creating structured data from unstructured PDF’s was a challenge. We developed a software parser which leveraged the patterns associated with each debate type to convert it into a structured format and store it into a database. The dataset created consisted of 768 debates from the Lok Sabha of the Indian Parliament spanning over 189 sessions. We also discuss the potential use cases of the dataset such as experimenting various natural language processing techniques like stance classification, sentiment analysis, detection of blame and appreciate speeches, generating statistical analysis of parties and their members and statistical analysis of the parliamentary debates. We explain the significance of the dataset by performing a few preliminary experiments such as stance classification, calculating sentiment po- larity and detection of speeches which capture raising of Issues, Blame, Appreciate or Call for Action. We achieved an accuracy scores of 0.80, 0.74, 0.60, 0.84 and 0.80 for stance classification, detection of issue, blame, appreciate and call for action speeches respectively. In the next part of the thesis, we harness this dataset for creating an argument annotated dataset. This dataset is ideally suited for creation of argument dataset because of the argumentative nature of the parliament debates. We annotate the debates with the argumentative theory which contains two argument components ‘claim’ and ‘premise’ and two argument relations support and attack which connect these two components. Few researchers have developed argument datasets of essays, news articles and online debate data but not of parliamentary proceedings. Creating an argument annotated dataset is a non-trivial task due to the complexity in the theory. Annotation of arguments is a non-trivial task as identification of argument components and relations between them needs critical attention and complete understating of the theory. We achieve an average score of 0.31 using multi-π inter annotator agreement metric before training of the annotators. After providing guidelines and training the annotators we achieve an average score of 0.73 using using multi-π inter annotator agreement metric which is a significant improvement and shows that arguments can be reliably annotated in parliamentary proceedings. To show the potential and novel application of the argumentation theory, we created a pipeline based approach for generating extractive summaries of the parliamentary debates using the argumentative theory. We use argumentation as a feature and design a semi automatic pipeline for generating debate sum- maries. The proposed approach considers the relevance of the topic, argumentative nature, sentiment and context features for generating summaries. We test our approach on the argument annotated dataset of debates developed previously. We discuss and analyze the results obtained using the pipeline. Our proposed methodology achieves F-scores of 0.44, 0.21, 0.41 using ROUGE-1, ROUGE-2 and ROUGEL evaluation metrics over other summarization algorithms in the case of 1500 word summary generation.We also evaluate our approach with other high performing popular summarization algorithms such as TextRank, Nenkova and LexRank. Full thesis: pdf Centre for Exact Humanities |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |