IIIT Hyderabad Publications |
|||||||||
|
Enabling Code-Mixed Translation: Parallel Corpus Creation and MT Augmentation ApproachAuthors: Mrinal Dhar,Vaibhav Kumar,Manish Shrivastava Conference: Linguistic Resources for NLP Workshop (27th International Conference on Computational Linguistics) (LR4NLP 2018 (COLING-2018) 2018) Location Conference Santa Fe, New Mexico, USA Date: 2018-08-20 Report no: IIIT/TR/2018/73 AbstractCode-mixing, use of two or more languages in a single sentence, is ubiquitous; generated by multi-lingual speakers across the world. The phenomenon presents itself prominently in social media discourse. Consequently, there is a growing need for translating code-mixed hybrid language into standard languages. However, due to the lack of gold parallel data, existing machine translation systems fail to properly translate code-mixed text. In an effort to initiate the task of machine translation of code-mixed content, we present a newly created parallel corpus of code-mixed English-Hindi and English. We selected previously available English-Hindi code-mixed data as a starting point for our parallel corpus, and 4 human translators, fluent in both English and Hindi, translated the 6096 code-mixed English-Hindi sentences to English. With the help of the created parallel corpus, we analyzed the structure of English-Hindi code-mixed data and present a technique to augment run-of-the-mill machine translation (MT) approaches that can help achieve superior translations without the need for specially designed translation systems. The augmentation pipeline is presented as a pre-processing step and can be plugged with any existing MT system, which we demonstrate by improving code-mixed translations done by systems like Moses, Google Neural Machine Translation System (NMTS) and Bing Translator. Full paper: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |