New Parallel Algorithms for Large-Scale Matrix Computations

Author(s)
Huang, Hua
Advisor(s)
Editor(s)
Associated Organization(s)
Organizational Unit
Organizational Unit
School of Computational Science and Engineering
School established in May 2010
Supplementary to:
Abstract
Matrix computation is a critical part of scientific simulations and data analysis in fields such as density functional theory, recommendation systems, and neural networks. With the exponential growth in dataset sizes and computational problems, it is imperative to develop scalable parallel algorithms for efficient dense and sparse matrix operations. This dissertation presents innovative scale-out algorithms that leverage multiple processors to simultaneously and addressing data movement bottlenecks, and scale-up algorithms focus on single-node performance by exploiting matrix properties like sparsity and low-rank structures. In this dissertation, we first propose the communication-avoiding 3D matrix multiplication algorithm (CA3DMM). CA3DMM is an approach to parallelize general dense matrix multiplication that minimizes communication sizes by optimizing matrix partitioning and data transfer patterns. CA3DMM achieves the theoretical communication cost lower bound with a simple formation and shows superior parallel performance compared to existing methods. We then propose a hybrid approach for efficient distributed-memory polar decomposition (PD) calculation, which can be used for computing eigenvalue decomposition and singular value decomposition. The proposed hybrid PD approach combines multiple iterative methods for computing PD and utilizes CA3DMM and a new parallel orthonormalization algorithm for better performance. For large sparse matrices, we propose the communication-reduced parallel sparse-dense matrix multiplication (CRP-SpMM) algorithm. This algorithm can benefit from existing sparse matrix partitioning methods for sparse matrix-dense vector multiplication (SpMV), and further explores more effective 2D process grid sizes to reduce communication costs by parallelizing the computation of different columns of the dense input matrix. The parallel implementation of CRP-SpMM significantly outperforms existing distributed-memory parallel SpMM codes. Lastly, we present the design and parallel implementation of H2Pack, a high-performance multi-purpose library for kernel matrices. Combining multiple state-of-the-art mathematical methods, H2Pack can compress kernel matrices using the H^2 matrix format, resulting in O(N) storage and matrix-vector multiplication costs. As a result, H2Pack can easily and efficiently handle million-by-million kernel matrices on personal computers, while outperforming the widely used fast multipole method (FMM) on multiple tasks.
Sponsor
Date
2024-07-17
Extent
Resource Type
Text
Resource Subtype
Dissertation
Rights Statement
Rights URI