Algorithms for Data Fusion, Representation Learning, and Scalable Clustering based on Constrained Low-Rank Approximation

Author(s)
Choi, Dongjin
Advisor(s)
Editor(s)
Associated Organization(s)
Organizational Unit
Organizational Unit
School of Computational Science and Engineering
School established in May 2010
Series
Supplementary to:
Abstract
In this dissertation, we address significant challenges in modern data analysis through novel Constrained Low-Rank Approximation (CLRA) algorithms. Contemporary datasets often exhibit complex characteristics: multi-type heterogeneity with interconnected entities, information from diverse sources, inherent sparsity, and massive scale. Traditional methods struggle with these complexities, motivating our development of advanced CLRA techniques that strategically incorporate constraints into matrix factorization. We introduce three primary contributions. First, we develop a Collective Symmetric Nonnegative Matrix Factorization framework for co-embedding multi-type data, constructing an integrated similarity matrix capturing both intra-type and inter-type relationships. We extend this to Multi-Granularity NMF, discovering hierarchical structures across levels via alignment matrices. Second, we propose WellFactor, an integrative framework for robust patient profiling from heterogeneous healthcare data, handling missing diagnoses via matrix masking under an open-world assumption and incorporating semi-supervision. Third, we develop the k-vertices clustering framework, a generalization of k-means to perform soft clustering. We further enhance it with minimum-volume regularization to mitigate vertex ambiguity and proposing a computationally efficient hierarchical algorithm. Through extensive experiments on diverse datasets spanning academic literature, healthcare records, and benchmark collections, we demonstrate that our methods outperform relevant baselines in co-embedding quality, clustering accuracy, prediction performance, and computational efficiency. Our contributions advance CLRA methods for robust data fusion and representation learning, providing practical tools for extracting meaningful insights from complex, multi-type data across various applications including interactive exploration, patient subtyping, and general clustering tasks.
Sponsor
Date
2025-05-16
Extent
Resource Type
Text
Resource Subtype
Dissertation
Rights Statement
Rights URI