Real-Time Speaker Diarization with Dynamic Enrollment Using External Knowledge

Author(s)
Sohn, Chanwoo
Editor(s)
Associated Organization(s)
Supplementary to:
Abstract
Real-time speaker diarization with dynamic enrollment faces a fundamental challenge: building reliable speaker models from limited data while maintaining consistent speaker identities throughout recordings. Existing systems suffer from two critical limitations: covariance instability causing speaker fragmentation, and threshold sensitivity preventing universal deployment. This thesis introduces two complementary contributions to address these limitations. First, VoxCeleb1 acoustic matching stabilizes speaker covariance estimation by identifying the reference speaker whose covariance structure best fits the target speaker's observed data through log-likelihood scoring, rather than estimating covariances from limited enrollment data. Second, a scale-invariant, outlier-robust detection mechanism replaces the summation-based distance metric with trimmed mean Mahalanobis distance computation, enabling universal threshold usage across different covariance structures and system configurations. Experimental evaluation on the VoxConverse dataset demonstrates that the proposed method achieves an 87.9% reduction in mean absolute speaker counting error compared to the baseline, with 75% of recordings falling within ±2 speakers of ground truth. These results establish that principled integration of external knowledge with robust statistical methods can significantly enhance real-time speaker diarization performance.
Sponsor
Date
2026-05
Extent
Resource Type
Text
Resource Subtype
Thesis (Masters Degree)
Rights Statement
Rights URI