IIIT Hyderabad Publications |
|||||||||
|
Exploring Cross-lingual Summarization and Machine Translation Quality EstimationAuthor: Nisarg Jhaveri Date: 2018-12-28 Report no: IIIT/TH/2018/97 Advisor:Vasudeva Varma AbstractThe need for cross-lingual information access is more than ever with the easy access to the Internet. This is especially true in vastly multilingual societies like India. Cross-lingual summarization (CLS) aims to create summaries in a target language from a document or document set given in a different, source language. Cross-lingual summarization can have significant impact on enabling cross-lingual information access for millions of people across the globe who do not speak or understand languages that have a large representation on the web by making the most important information available in the target language in the form of summaries. It can also make documents originally published in local languages quickly accessible to a large audience which does not understand those local languages. Working towards a better cross-lingual summarization system, we first create a flexible, web-based tool, referred to as the workbench, for human editing of cross-lingual summaries to rapidly generate publishable summaries in a number of Indian languages for news articles originally published in English. The workbench simultaneously collects detailed logs about the editing process at article, summary and sentence level. Similar to translation post-editing logs, such logs can be used to evaluate the automated cross-lingual summaries in terms of effort needed to make them publishable. We use the workbench to generate two manually edited datasets for different tasks. We observed that quality of automatic translation is a major bottleneck when working on CLS. Translation Quality Estimation (QE) aims to estimate the quality of an automated machine translation (MT) output without any human intervention or reference translation. With the increasing use of MT systems in various cross-lingual applications, the need and applicability of QE systems is increasing. We study existing approaches and propose multiple neural network approaches for sentence-level QE with a focus on MT outputs in Indian languages. For this, we also introduce five new datasets for four language pairs: two for English–Gujarati, and one each for English–Hindi, English–Telugu and English–Bengali, which includes one manually post-edited dataset for English–Gujarati created using the workbench. We compare results obtained using our proposed models with multiple existing state-of-the-art systems including the winning system in the WMT17 shared task on QE and show that our proposed neural model which combines the discriminative power of carefully chosen features with Siamese Convolutional Neural Networks (CNNs) works significantly better for all Indian language datasets. Later, we integrate our efforts on QE with cross-lingual summarization to study its effect on CLS. We extend a popular mono-lingual summarization method to work with CLS, along with a new objective function to take QE scores into account while ranking sentences for summarization. We experimentwith a number of existing methods for CLS with different parameters and settings and show comparative analysis. At the end, we publish an end-to-end CLS software called clstk to make CLS accessible to a larger audience. Besides implementing a number of methods proposed by different CLS researchers over the years, the tool-kit also includes bootstrap code for easy implementation and experimentation with new CLS methods. We hope that this extremely modular tool-kit will help CLS researchers contribute more effectively to the area as well as developers to easily use existing methods in end-user applications. Full thesis: pdf Centre for Search and Information Extraction Lab |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |