A Multi Modal Approach to Speech-to-Sign Language Generation

Author: Mounika Kanakanti 2019900003
Date: 2024-05-25
Report no: IIIT/TH/2024/107
Advisor:Manish Shrivastava

Abstract

Language is a communication system used to share complex thoughts/ideas and is a powerful tool for social cognition. It relies on a multitude of verbal and non-verbal cues to share information. Analyzing the interplay of these language cues within individuals with distinct sensory experiences provides a valuable perspective for comprehending natural languages. This comprehension is achieved by gaining insights into how analogous contextual information is conveyed through varying modalities. Research in these areas is not only of theoretical interest but may also have important practical implications for building more inclusive solutions. Sign language is a rich form of communication, uniquely conveying meaning through a combination of signs, facial expressions, and body movements. While Natural Language Processing (NLP) has significantly advanced, progress in supporting sign language has been less substantial. To bridge this gap, automatic sign language translation and generation systems offer an efficient and accessible way to facilitate communication between the deaf and hearing communities. Existing research in sign language generation has predominantly focused on text-to-sign pose generation, while speech-to-sign pose generation remains relatively underexplored. Speech-to-sign language generation models can facilitate effective communication between the deaf and hearing communities. In this work, we propose an architecture that utilises prosodic information from speech audio, and semantic context from text to generate sign pose sequences. In our approach, we adopt a multi-tasking strategy that involves an additional task of predicting face expressions in the form of Facial Action Units (FAUs). FAUs capture the intricate facial muscle movements that play a crucial role in conveying specific facial expressions during sign language generation. We train our models on an existing Indian Sign language dataset that contains sign language videos with audio and text translations. To evaluate our models, we report Dynamic Time Warping (DTW) and Probability of Correct Keypoints (PCK) scores. We find that combining prosody and text as input, along with incorporating facial action unit prediction as an additional task, outperforms previous models in both DTW and PCK scores. We also discuss the challenges and limitations of speech-to-sign pose generation models to encourage future research in this domain.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

A Multi Modal Approach to Speech-to-Sign Language Generation

Abstract