AV Speech Separation

30 December 2024

Overview

The An Audio-Visual Speech Separation and Personalized Keyphrase Detection in Noisy Environments is an advanced project inspired by the human brain's ability to focus on a single voice amid overlapping conversations, known as the "cocktail party effect". This system addresses challenges in scenarios like conferences, public events, and crowded spaces where traditional audio processing falls short. By leveraging audio-visual cues, the project combines facial movement detection, such as lip movements, with advanced audio filtration techniques to remove background noise and improve transcription accuracy. The system ensures synchronization and clarity by mapping audio to corresponding visual elements, making it applicable in domains like security, media production, and assistive technologies.

Key Features

Accurate Audio Separation: The system isolates individual audio streams from multiple speakers in a video, ensuring clear separation of voices even in overlapping conversations.
Audio-Visual Synchronization: It maps separated audio tracks to the corresponding lip movements of each speaker, maintaining a precise relationship between audio and visual components.
Enhanced Captioning: Improves automatic captioning systems by generating individual captions for each speaker, even in complex multi-speaker scenarios.

Output

My teammates, Rishab R Budale, Tejas Nayak B, and Hithaish, and I had the opportunity to present our paper at the 3rd Congress on Control, Robotics, and Mechatronics (CRM2025), organized by SR University, Warangal, India, on February 2, 2025. This work was built under the guidance of Dr. Priya R Kamath.

Srajan Kumar2025