Incorporating Geometric and Consistency Constraints into Deep Models for Robust Phase Reconstruction and Speech Enhancement

Author(s)
Ku, Pin-Jui
Advisor(s)
Editor(s)
Associated Organization(s)
Supplementary to:
Abstract
Phase reconstruction of short-term Fourier transform (STFT) spectra remains a central challenge in speech generation tasks. Although early studies emphasized magnitude over phase, recent work has shown that accurate phase reconstruction is crucial for reducing artifacts and distortions that degrade speech quality in enhancement (SE). With the rapid advancement of deep neural networks (DNNs), recent research has aimed to estimate both the magnitude and phase of clean spectrograms simultaneously. However, most approaches directly predict the phase spectrogram—a task made difficult by the phase’s unstructured nature, wrapping ambiguities, and extreme sensitivity to time shifts. Such direct estimation also overlooks alternative phase configurations that can yield perceptually valid speech. This dissertation proposes a new framework for phase estimation that overcomes these limitations and demonstrates its effectiveness across multiple DNN-based SE models. We begin by introducing the first deep state-space-based SE model operating on complex-valued spectrograms. While it surpasses baseline models with a compact U-Net architecture, its estimated phase offers limited improvement over the noisy phase, underscoring the difficulty of direct phase prediction. To address this, we develop a novel explicit consistency-preserving loss that leverages the observation that perceptually high-quality speech arises when magnitude and phase are mutually consistent. Building on this insight, we integrate geometric constraints under additive noise conditions with the consistency principle, resulting in the Multi-Sourced Griffin-Lim Algorithm (MSGLA). MSGLA jointly refines speech and noise phases through iterative updates guided by DNN-estimated magnitudes and geometric relationships, outperforming direct phase estimation and prior geometric methods. Finally, we extend these ideas to a large-scale generative pretraining framework that models the distribution of clean speech spectrograms and incorporates the consistency-based phase loss during training.
Sponsor
Date
2025-12
Extent
Resource Type
Text
Resource Subtype
Dissertation (PhD)
Rights Statement
Rights URI