Beyond Text: Expanding Speech Synthesis with Lip-to-Speech and Multi-Modal Fusion

Author: Neha Sahipjohn 2021702012
Date: 2024-06-20
Report no: IIIT/TH/2024/108
Advisor:Vineet Gandhi

Abstract

Speech constitutes a fundamental aspect of human communication. Therefore, the ability of computers to synthesize speech is paramount for achieving more natural human-computer interactions and increased accessibility, particularly for individuals with reading limitations. Recent advancements in AI and machine learning technologies, alongside generative AI techniques, have significantly improved speech synthesis quality. Text input serves as a common modality for speech synthesis, and Text-toSpeech (TTS) systems have achieved notable milestones in terms of intelligibility and naturalness. In this thesis, we propose a system to synthesize speech directly from lip movements and explore the idea of a unified speech synthesis model that can synthesize speech from different modalities, like text-only, video-only or combined text and video inputs. This facilitates applications in dubbing and accessibility initiatives aimed at providing voice to individuals who are unable to vocalize. This innovation promises streamlined communication in noisy environments as well. We propose a novel system for lip-to-speech synthesis that achieves state-of-the-art performance by leveraging advancements in selfsupervised learning and sequence-to-sequence networks. This enables the generation of highly intelligible and natural-sounding speech even with limited data. Existing lip-to-speech systems primarily focus on directly synthesizing speech or mel-spectrograms from lip movements. This often leads to compromised intelligibility and naturalness due to the entanglement of speech content with ambient information and speaker characteristics. We propose a modularized approach that uses representations that disentangle speech content from speaker characteristics, leading to superior performance. Our work sheds light on the information-rich nature of embedding spaces compared to tokenized representations. The system maps lip movement representations to disentangled speech representations, which are then fed into a vocoder for speech generation. Recognizing the potential applications in dubbing and the importance of synthesizing accurate speech, we explore a multimodal input setting by incorporating text alongside lip movements. Through extensive experimentation and evaluation across various datasets and metrics, we demonstrate the superior performance achieved by our proposed method. Our approach demonstrates high correctness and intelligibility, paving the way for practical deployment in real-world scenarios. Our work contributes significantly to advancing the field of lip-to-speech synthesis, offering a robust and versatile solution for generating natural-sounding speech from silent videos with broader implications for accessibility, human-computer interaction, and communication technology.

Full thesis: pdf

Centre for Visual Information Technology

IIIT Hyderabad Publications

Beyond Text: Expanding Speech Synthesis with Lip-to-Speech and Multi-Modal Fusion

Abstract