Towards developing a phonetically balanced code-mixed speech corpus for Hindi-English ASR

Authors: Ayushi Pandey,Brij Mohan lal,Suryakanth V Gangashetty
Conference: 14th International Conference on Natural Language Processing (ICON-2017) (ICON-2017 2017)
Location Jadavpur University, Kolkata, India
Date: 2017-12-18
Report no: IIIT/TR/2017/111

Abstract

This paper presents the ongoing process in the design of the first phase of the phonetically balanced codemixed corpus of Hindi-English speech (PBCM-Phase I). The reference corpus is a large code-mixed (LCM) newspaper corpus selected from the sections that contain frequent English insertions in a matrix of Hindi sentence. From a phonetically transcribed corpus, compulsory inclusion of lowest frequency triphones has been ensured, with the assumption that high frequency phones may automatically be included. A high correlation of 0.81 with the representative large corpus has been observed. A small scale speech corpus of 5.6 hours has been collected, by the contribution of 4 volunteer native Hindi speakers. The recording has been conducted in a professional recording studio environment. As a second contribution, this paper also presents a baseline recognition system with pooled monolingual and codemixed speech datasets as training and testing environments.

Full paper: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Towards developing a phonetically balanced code-mixed speech corpus for Hindi-English ASR

Abstract