Scalable Data Mining via Constrained Low Rank Approximation

Eswar, Srinivas

Title:

Scalable Data Mining via Constrained Low Rank Approximation

dc.contributor.advisor	Vuduc, Richard
dc.contributor.advisor	Park, Haesun
dc.contributor.author	Eswar, Srinivas
dc.contributor.committeeMember	Catalyurek, Umit
dc.contributor.committeeMember	Chow, Edmond
dc.contributor.committeeMember	Ballard, Grey
dc.contributor.department	Computational Science and Engineering
dc.date.accessioned	2022-08-25T13:40:42Z
dc.date.available	2022-08-25T13:40:42Z
dc.date.created	2022-08
dc.date.issued	2022-08-01
dc.date.submitted	August 2022
dc.date.updated	2022-08-25T13:40:42Z
dc.description.abstract	Matrix and tensor approximation methods are recognised as foundational tools for modern data analytics. Their strength lies in their long history of rigorous and principled theoretical foundations, judicious formulations via various constraints, along with the availability of fast computer programs. Multiple Constrained Low Rank Approximation (CLRA) formulations exist for various commonly encountered tasks like clustering, dimensionality reduction, anomaly detection, amongst others. The primary challenge in modern data analytics is the sheer volume of data to be analysed, often requiring multiple machines to just hold the dataset in memory. This dissertation presents CLRA as a key enabler of scalable data mining in distributed-memory parallel machines. Nonnegative Matrix Factorisation (NMF) is the primary CLRA method studied in this dissertation. NMF imposes nonnegativity constraints on the factor matrices and is a well studied formulation known for its simplicity, interpretability, and clustering prowess. The major bottleneck in most NMF algorithms is a distributed matrix-multiplication kernel. We develop the Parallel Low rank Approximation with Nonnegativity Constraints (PLANC) software package, building on the earlier MPI-FAUN library, which includes an efficient matrix-multiplication kernel tailored to the CLRA case. It employs carefully designed parallel algorithms and data distributions to avoid unnecessary computation and communication. We extend PLANC to include several optimised Nonnegative Least-Squares (NLS) solvers and symmetric constraints, effectively employing the optimised matrix-multiplication kernel. We develop a parallel inexact Gauss-Newton algorithm for Symmetric Nonnegative Matrix Factorisation (SymNMF). In particular PLANC is able to efficiently utilise second-order information when imposing symmetry constraints without incurring the prohibitive memory and computational costs associated with these methods. We are able to observe 70% efficiency while scaling up these methods. We develop new parallel algorithms for fusing and analysing data with multiple modalities in the Joint Nonnegative Matrix Factorisation (JointNMF) context. JointNMF is capable of knowledge discovery when both feature-data and data-data information is present in a data source. We extend PLANC to handle this case of simultaneously approximating two different large input matrices and study the various trade-offs encountered in the bottleneck matrix-multiplication kernel. We show that these ideas translate naturally to the multilinear setting when data is presented in the form of a tensor. A bottleneck computation analogous to the matrix-multiply, the Matricised-Tensor Times Khatri-Rao Product (MTTKRP) kernel, is implemented. We conclude by describing some avenues for future research which extend the work and ideas in this dissertation. In particular, we consider the notion of structured sparsity, where the user has some control over the nonzero pattern, which appears in computations for various tasks like cross-validation, working with missing values, robust CLRA models, and in the semi-supervised setting.
dc.description.degree	Ph.D.
dc.format.mimetype	application/pdf
dc.identifier.uri	http://hdl.handle.net/1853/67334
dc.language.iso	en_US
dc.publisher	Georgia Institute of Technology
dc.subject	Low rank approximation
dc.subject	Parallel algorithms
dc.subject	High performance computing
dc.subject	Data Mining
dc.title	Scalable Data Mining via Constrained Low Rank Approximation
dc.type	Text
dc.type.genre	Dissertation
dspace.entity.type	Publication
local.contributor.advisor	Park, Haesun
local.contributor.advisor	Vuduc, Richard
local.contributor.corporatename	College of Computing
local.contributor.corporatename	School of Computational Science and Engineering
relation.isAdvisorOfPublication	92013a6f-96b2-4ca8-9ef7-08f408ec8485
relation.isAdvisorOfPublication	e9a36794-e148-4304-8933-6ae0449c21d2
relation.isOrgUnitOfPublication	c8892b3c-8db6-4b7b-a33a-1b67f7db2021
relation.isOrgUnitOfPublication	01ab2ef1-c6da-49c9-be98-fbd1d840d2b1
thesis.degree.level	Doctoral