Extractive Text Summarisation in Hindi

Author: Sakshee Vijay
Date: 2019-04-12
Report no: IIIT/TH/2019/28
Advisor:Dipti Misra Sharma

Abstract

With an immense amount of data growing on the web in Hindi, a text summariser is helpful in summarising government data, medical reports, news, and research articles or any other digitalized documented data. The volume of available text has increased beyond the reading capacity of a single user — a search term on the web results in thousands of relevant pages. Recent improvements in the search engines yield better ranked documents that aim to provide precise information. However, these documents in itself are significantly huge to be read in a few minutes. Hence, the availability of summarisation tools in this domain is of utmost importance.According to Wikipedia, Hindi is the fourth most-spoken first language in the world. Hindi is written in the Devanagari script. Hindi is growing digitally and serves nearly a billion people residing in India. Summarising Hindi news articles is essential. As of now, not enough work has been done for summarising Hindi news articles automatically. Although the dominant form of news consumption in India is through newspapers, digital news consumption is becoming increasingly popular. With the world moving towards virtual reality and digitalization using smartphones, there is a need to summarise the data to make it more readable and concise. Summaries provide an efficient way to deal with the growth of data. A news summariser can be built to manage the issue of data growth on a large scale. A gradual move towards summarization of the digitally available Hindi news would lead to a higher population reading news efficiently. We have aimed to analyze the following approaches for text summarization: 1. Frequency-based approach 2. Graph-based approach 3. Feature-based approach We developed an end to end extractive text summariser for news articles and tested it on a dataset of about 24 thousand articles. We achieved excellent results so far. The extractive dataset of 200 article-summary pair developed as a part of this thesis is of great use for the research community. Currently, there is no public dataset for extractive summarisation available in Hindi. To solve the issue, we extracted a dataset of 24253 news articles. We evaluated extractive summary results on various parameters with manual gold abstractive summaries of exactly 60 words each. For evaluation purposes, we created a standard article summary pair, of the same type of summaries. In the gold data of extracted news articles, summaries were in abstractive form, and we developed an extractive summariser. So we manually created the gold dataset of extractive summaries of 200 articles to test the summariser on the ideal platform.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Extractive Text Summarisation in Hindi

Abstract