IIIT Hyderabad Publications |
|||||||||
|
Towards developing a phonetically balanced code-mixed speech corpus for Hindi-English ASRAuthors: Ayushi Pandey,Brij Mohan lal,Suryakanth V Gangashetty Conference: 14th International Conference on Natural Language Processing (ICON-2017) (ICON-2017 2017) Location Jadavpur University, Kolkata, India Date: 2017-12-18 Report no: IIIT/TR/2017/111 AbstractThis paper presents the ongoing process in the design of the first phase of the phonetically balanced codemixed corpus of Hindi-English speech (PBCM-Phase I). The reference corpus is a large code-mixed (LCM) newspaper corpus selected from the sections that contain frequent English insertions in a matrix of Hindi sentence. From a phonetically transcribed corpus, compulsory inclusion of lowest frequency triphones has been ensured, with the assumption that high frequency phones may automatically be included. A high correlation of 0.81 with the representative large corpus has been observed. A small scale speech corpus of 5.6 hours has been collected, by the contribution of 4 volunteer native Hindi speakers. The recording has been conducted in a professional recording studio environment. As a second contribution, this paper also presents a baseline recognition system with pooled monolingual and codemixed speech datasets as training and testing environments. Full paper: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |