Scalable and resilient sparse linear solvers

Sao, Piyush kumar

Title:

Scalable and resilient sparse linear solvers

Files

SAO-DISSERTATION-2018.pdf (5.2 MB)

Author(s)

Sao, Piyush kumar

Advisor(s)

Vuduc, Richard

Advisor(s)

Person

Vuduc, Richard

Associated Organization(s)

Organizational Unit

College of Computing

Organizational Unit

School of Computational Science and Engineering

Collections

Theses and Dissertations

Permanent Link

http://hdl.handle.net/1853/60233

Abstract

Solving a large and sparse system of linear equations is a ubiquitous problem in scientific computing. The challenges in scaling such solvers on current and future parallel computer systems are the high-cost of communication and the expected decrease in reliability of the hardware components. This dissertation contributes new techniques to address these issues. Regarding communication, we make two advances to reduce both on-node and inter-node communication of distributed memory sparse direct solvers. On-node, we propose a novel technique, called the HALO, targeted at heterogeneous architectures consisting of multicore and hardware accelerator such as GPU or Xeon-Phi. The name HALO is a shorthand for highly asynchronous lazy offload, which refers to the way the method combines highly aggressive use of asynchrony with the accelerated offload, lazy updates, and data shadowing (a la Halo or ghost zones), all of which serve to hide and reduce communication, whether to local memory, across the network, or over PCIe. The overall hybrid solver achieves speed-up of up-to 3x on a variety of realistic test problems in single and multi-node configurations. To reduce inter-node communication, we present a novel communication-avoiding 3D sparse LU factorization algorithm. The 3D sparse LU factorization algorithm uses a three-dimensional logical arrangement of MPI processes and combines the data redundancy with the so-called elimination tree parallelism to reduce the communication. The 3D algorithm reduces the asymptotic communication costs by a factor of $O(\sqrt(log n))$ and latency costs by a factor of $O(log n)$ for planar sparse matrices arising from finite element discretization of two-dimensional PDEs. For the non-planar sparse matrices, it reduces the communication and latency costs by a constant factor. Beyond performance, we consider methods to improve solver resilience. In emerging and future systems with billions of computing elements, hardware faults during the execution may become a norm rather than an exception. We illustrate the principle of self-stabilization for constructing fault-tolerant iterative linear solvers. We give two proof-of-concept examples of self-stabilizing iterative linear solvers: one for steepest descent (SD) and one for conjugate gradients (CG). Our self-stabilized versions of SD and CG require small amounts of fault-detection, e.g., we may check only for NaNs and infinities. We test our approach experimentally by analyzing its convergence and overhead for different types and rates of faults.

Date Issued

2018-05-22

Resource Type

Text

Resource Subtype

Dissertation

Full item page

Title:

Scalable and resilient sparse linear solvers

Files

Author(s)

Authors

Advisor(s)

Advisor(s)

Editor(s)

Associated Organization(s)

Series

Collections

Supplementary to

Permanent Link

Abstract

Sponsor

Date Issued

Extent

Resource Type

Resource Subtype

Rights Statement

Rights URI

Georgia Tech Library

Title: Scalable and resilient sparse linear solvers

Files

Author(s)

Authors

Advisor(s)

Advisor(s)

Editor(s)

Associated Organization(s)

Series

Collections

Supplementary to

Permanent Link

Abstract

Sponsor

Date Issued

Extent

Resource Type

Resource Subtype

Rights Statement

Rights URI

Title:

Scalable and resilient sparse linear solvers