Real-Time Speaker Diarization with Dynamic Enrollment Using External Knowledge
Author(s)
Sohn, Chanwoo
Advisor(s)
Editor(s)
Collections
Supplementary to:
Permanent Link
Abstract
Real-time speaker diarization with dynamic enrollment faces a fundamental challenge: building reliable speaker models from limited data while maintaining
consistent speaker identities throughout recordings. Existing systems suffer
from two critical limitations: covariance instability causing speaker
fragmentation, and threshold sensitivity preventing universal deployment.
This thesis introduces two complementary contributions to address these
limitations. First, VoxCeleb1 acoustic matching stabilizes speaker covariance
estimation by identifying the reference speaker whose covariance structure best
fits the target speaker's observed data through log-likelihood scoring, rather
than estimating covariances from limited enrollment data. Second, a
scale-invariant, outlier-robust detection mechanism replaces the
summation-based distance metric with trimmed mean Mahalanobis distance
computation, enabling universal threshold usage across different covariance
structures and system configurations.
Experimental evaluation on the VoxConverse dataset demonstrates that the proposed method achieves an 87.9% reduction in mean absolute speaker counting
error compared to the baseline, with 75% of recordings falling within ±2
speakers of ground truth. These results establish that principled integration
of external knowledge with robust statistical methods can significantly enhance
real-time speaker diarization performance.
Sponsor
Date
2026-05
Extent
Resource Type
Text
Resource Subtype
Thesis (Masters Degree)