Title:
Scalable Data Mining via Constrained Low Rank Approximation

dc.contributor.advisor Vuduc, Richard
dc.contributor.advisor Park, Haesun
dc.contributor.author Eswar, Srinivas
dc.contributor.committeeMember Catalyurek, Umit
dc.contributor.committeeMember Chow, Edmond
dc.contributor.committeeMember Ballard, Grey
dc.contributor.department Computational Science and Engineering
dc.date.accessioned 2022-08-25T13:40:42Z
dc.date.available 2022-08-25T13:40:42Z
dc.date.created 2022-08
dc.date.issued 2022-08-01
dc.date.submitted August 2022
dc.date.updated 2022-08-25T13:40:42Z
dc.description.abstract Matrix and tensor approximation methods are recognised as foundational tools for modern data analytics. Their strength lies in their long history of rigorous and principled theoretical foundations, judicious formulations via various constraints, along with the availability of fast computer programs. Multiple Constrained Low Rank Approximation (CLRA) formulations exist for various commonly encountered tasks like clustering, dimensionality reduction, anomaly detection, amongst others. The primary challenge in modern data analytics is the sheer volume of data to be analysed, often requiring multiple machines to just hold the dataset in memory. This dissertation presents CLRA as a key enabler of scalable data mining in distributed-memory parallel machines. Nonnegative Matrix Factorisation (NMF) is the primary CLRA method studied in this dissertation. NMF imposes nonnegativity constraints on the factor matrices and is a well studied formulation known for its simplicity, interpretability, and clustering prowess. The major bottleneck in most NMF algorithms is a distributed matrix-multiplication kernel. We develop the Parallel Low rank Approximation with Nonnegativity Constraints (PLANC) software package, building on the earlier MPI-FAUN library, which includes an efficient matrix-multiplication kernel tailored to the CLRA case. It employs carefully designed parallel algorithms and data distributions to avoid unnecessary computation and communication. We extend PLANC to include several optimised Nonnegative Least-Squares (NLS) solvers and symmetric constraints, effectively employing the optimised matrix-multiplication kernel. We develop a parallel inexact Gauss-Newton algorithm for Symmetric Nonnegative Matrix Factorisation (SymNMF). In particular PLANC is able to efficiently utilise second-order information when imposing symmetry constraints without incurring the prohibitive memory and computational costs associated with these methods. We are able to observe 70% efficiency while scaling up these methods. We develop new parallel algorithms for fusing and analysing data with multiple modalities in the Joint Nonnegative Matrix Factorisation (JointNMF) context. JointNMF is capable of knowledge discovery when both feature-data and data-data information is present in a data source. We extend PLANC to handle this case of simultaneously approximating two different large input matrices and study the various trade-offs encountered in the bottleneck matrix-multiplication kernel. We show that these ideas translate naturally to the multilinear setting when data is presented in the form of a tensor. A bottleneck computation analogous to the matrix-multiply, the Matricised-Tensor Times Khatri-Rao Product (MTTKRP) kernel, is implemented. We conclude by describing some avenues for future research which extend the work and ideas in this dissertation. In particular, we consider the notion of structured sparsity, where the user has some control over the nonzero pattern, which appears in computations for various tasks like cross-validation, working with missing values, robust CLRA models, and in the semi-supervised setting.
dc.description.degree Ph.D.
dc.format.mimetype application/pdf
dc.identifier.uri http://hdl.handle.net/1853/67334
dc.language.iso en_US
dc.publisher Georgia Institute of Technology
dc.subject Low rank approximation
dc.subject Parallel algorithms
dc.subject High performance computing
dc.subject Data Mining
dc.title Scalable Data Mining via Constrained Low Rank Approximation
dc.type Text
dc.type.genre Dissertation
dspace.entity.type Publication
local.contributor.advisor Park, Haesun
local.contributor.advisor Vuduc, Richard
local.contributor.corporatename College of Computing
local.contributor.corporatename School of Computational Science and Engineering
relation.isAdvisorOfPublication 92013a6f-96b2-4ca8-9ef7-08f408ec8485
relation.isAdvisorOfPublication e9a36794-e148-4304-8933-6ae0449c21d2
relation.isOrgUnitOfPublication c8892b3c-8db6-4b7b-a33a-1b67f7db2021
relation.isOrgUnitOfPublication 01ab2ef1-c6da-49c9-be98-fbd1d840d2b1
thesis.degree.level Doctoral
Files
Original bundle
Now showing 1 - 1 of 1
Thumbnail Image
Name:
ESWAR-DISSERTATION-2022.pdf
Size:
2.58 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
LICENSE.txt
Size:
3.87 KB
Format:
Plain Text
Description: