Development of IIITH Hindi English Code Mixed Speech Database

Authors: Rambabu B,Suryakanth V Gangashetty
Conference: 6th international workshop on spoken language technologies for under-resourced languages(SLTU'18) (SLTU-2018 2018)
Location Gurugram, India
Date: 2018-08-29
Report no: IIIT/TR/2018/72

Abstract

This paper presents the design and development of IIITH Hindi-English code mixed (IIITH-HE-CM) text and corresponding speech corpus. The corpus is collected from several Hindi native speakers from different geographical parts of India. The IIITH-HE-CM corpus has phonetically balanced code mixed sentences with all the phoneme coverage of Hindi and English languages. We used triphone frequency of word internal triphone sequence, consists the language specific information, which helps in code mixed speech recognition and language modelling. The code mixed sentences are written in Devanagari script. Since computers can recognize Roman symbols, we used Indian Language Speech Sound Label (ILSL) transcription. An acoustic model is built for Hindi-English mixed language instead of language-dependent models. A large vocabulary code-mixing speech recognition system is developed based on a deep neural network (DNN) architecture. The proposed code-mixed speech recognition system attains low word error rate (WER) compared to conventional system.

Full paper: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Development of IIITH Hindi English Code Mixed Speech Database

Abstract