Lip Reading for Singing: Audio-Visual Approaches to Singing Voice Separation
Author(s)
Ma, Teng
Advisor(s)
Editor(s)
Collections
Supplementary to:
Permanent Link
Abstract
Singing voice separation is a challenging audio source separation task that has been approached predominantly through audio-only methods. This thesis investigates whether visual information derived from audio-visual speech embeddings can improve separation performance, and examines how training data composition shapes the contribution of visual conditioning and its generalization across datasets. We use AV-HuBERT audio-visual speech embeddings to condition an OpenUnmix-based audio back bone through feature-wise linear modulation (FiLM). A systematic evaluation is conducted across five conditioning configurations, three training sets, and three test sets.
Results show that visual conditioning consistently outperforms a fine-tuned audio-only baseline, with the everywhere configuration achieving the strongest performance across all experimental conditions. Gains are most reliable when models are trained on combined natural and simulated data — the only setting under which visual conditioning yields statistically significant improvements on natural data. Models trained exclusively on natural data exhibit poor out-of-distribution generalization, whereas simulated training data, enriched through random stem mixing, promotes greater robustness. These findings indicate that both conditioning architecture and training data distribution are critical factors in determining how effectively visual information can be exploited for singing voice separation.
Nevertheless, the improvements are modest and accompanied by notable limitations. The use of speech-pretrained visual embeddings introduces a domain gap that is particularly evident in the embeddings' failure to track melismatic singing. The scarcity of natural audio-visual singing data further constrains both model training and evaluation. The results are best interpreted as a proof of concept: visual cues from a singer's face can benefit separation even when features are derived from a speech-pretrained model, but fully realizing the potential of visual conditioning in this domain will require singing-specific representations and substantially richer data resources.
Sponsor
Date
2026-05
Extent
Resource Type
Text
Resource Subtype
Thesis (Masters Degree)