Organizational Unit:
School of Computational Science and Engineering

Research Organization Registry ID
Description
Previous Names
Parent Organization
Parent Organization
Organizational Unit
Includes Organization(s)

Publication Search Results

Now showing 1 - 5 of 5
  • Item
    Diagnosing performance bottlenecks in HPC applications
    (Georgia Institute of Technology, 2019-03-29) Czechowski, Kenneth
    The software performance optimizations process is one of the most challenging aspects of developing highly performant code because underlying performance limitations are hard to diagnose. In many cases, identifying performance bottlenecks, such as latency stalls, requires a combination of fidelity and usability that existing tools do not provide: traditional performance models and runtime analysis lack the granularity necessary to uncover low-level bottlenecks; while, architectural simulations are too cumbersome and fragile to employ as a primary source of information. To address this need, we propose a performance analysis technique, called Pressure Point Analysis (PPA), which delivers the accessibility of analytical models with the precision of a simulator. The foundation of this approach is based on an autotuning-inspired technique that dynamically perturbs binary code (e.g., inserting/deleting instructions to affect utilization of functional units, altering memory access addresses to change cache hit rate, or swapping registers to alter instruction level dependencies) to then analyze the effects various perturbations have on the overall performance. When systematically applied, a battery of carefully designed perturbations, which target specific microarchitectural features, can glean valuable insight about pressure points in the code. PPA provides actionable information about hardware-software interactions that can be used by the software developer to manually tweak the application code. In some circumstances the performance bottlenecks are unavoidable, in which case this analysis can be used to establish a rigorous performance bound for the application. In other cases, this information can identify the primary performance limitations and project potential performance improvements if these bottlenecks are mitigated.
  • Item
    Scalable tensor decompositions in high performance computing environments
    (Georgia Institute of Technology, 2018-07-31) Li, Jiajia
    This dissertation presents novel algorithmic techniques and data structures to help build scalable tensor decompositions on a variety of high-performance computing (HPC) platforms, including multicore CPUs, graphics co-processors (GPUs), and Intel Xeon Phi processors. A tensor may be regarded as a multiway array, generalizing matrices to more than two dimensions. When used to represent multifactor data, tensor methods can help analysts discover latent structure; this capability has found numerous applications in data modeling and mining in such domains as healthcare analytics, social networks analytics, computer vision, signal processing, and neuroscience, to name a few. When attempting to implement tensor algorithms efficiently on HPC platforms, there are several obstacles: the curse of dimensionality, mode orientation, tensor transformation, irregularity, and arbitrary tensor dimensions (or orders). These challenges result in non-trivial computational and storage overheads. This dissertation considers these challenges in the specific context of the two of the most popular tensor decompositions, the CANDECOMP/PARAFAC (CP) and Tucker decompositions, which are, roughly speaking, the tensor analogues to low-rank approximations in standard linear algebra. Within that context, two of the critical computational bottlenecks are the operations known as Tensor-Times-Matrix (TTM) and Matricized Tensor Times Khatri-Rao Product (MTTKRP). We consider these operations in cases when the tensor is dense or sparse. Our contributions include: 1) applying memoization to overcome the curse of dimensionality challenge that exists in a sequence of tensor operations; 2) addressing the challenge of mode orientation through a novel tensor format HICOO and proposing a parallel scheduler to avoid the locks for write-conflict memory; 3) carrying out TTM and MTTKRP operations in-place, for dense and sparse cases, to avoid tensor-matrix conversions; 4) employing different optimization and parameter tuning techniques for CPU and GPU implementations to conquer the challenges of the irregularity and arbitrary tensor orders. To validate these ideas, we have implemented them in three prototype libraries, named AdaTM, InTensLi, and ParTI!, for arbitrary-order tensors. AdaTM is a model-driven framework to generate an adaptive tensor memoization algorithm with the optimal parameters for sparse CP decomposition. InTensLi produces fast single-node implementations of dense TTM of an arbitrary dimension. ParTI! is short for a Parallel Tensor Infrastructure which is written in C, OpenMP, MPI, and NVIDIA CUDA for sparse tensors and supports MATLAB interfaces for application-level users.
  • Item
    Scalable and resilient sparse linear solvers
    (Georgia Institute of Technology, 2018-05-22) Sao, Piyush kumar
    Solving a large and sparse system of linear equations is a ubiquitous problem in scientific computing. The challenges in scaling such solvers on current and future parallel computer systems are the high-cost of communication and the expected decrease in reliability of the hardware components. This dissertation contributes new techniques to address these issues. Regarding communication, we make two advances to reduce both on-node and inter-node communication of distributed memory sparse direct solvers. On-node, we propose a novel technique, called the HALO, targeted at heterogeneous architectures consisting of multicore and hardware accelerator such as GPU or Xeon-Phi. The name HALO is a shorthand for highly asynchronous lazy offload, which refers to the way the method combines highly aggressive use of asynchrony with the accelerated offload, lazy updates, and data shadowing (a la Halo or ghost zones), all of which serve to hide and reduce communication, whether to local memory, across the network, or over PCIe. The overall hybrid solver achieves speed-up of up-to 3x on a variety of realistic test problems in single and multi-node configurations. To reduce inter-node communication, we present a novel communication-avoiding 3D sparse LU factorization algorithm. The 3D sparse LU factorization algorithm uses a three-dimensional logical arrangement of MPI processes and combines the data redundancy with the so-called elimination tree parallelism to reduce the communication. The 3D algorithm reduces the asymptotic communication costs by a factor of $O(\sqrt(log n))$ and latency costs by a factor of $O(log n)$ for planar sparse matrices arising from finite element discretization of two-dimensional PDEs. For the non-planar sparse matrices, it reduces the communication and latency costs by a constant factor. Beyond performance, we consider methods to improve solver resilience. In emerging and future systems with billions of computing elements, hardware faults during the execution may become a norm rather than an exception. We illustrate the principle of self-stabilization for constructing fault-tolerant iterative linear solvers. We give two proof-of-concept examples of self-stabilizing iterative linear solvers: one for steepest descent (SD) and one for conjugate gradients (CG). Our self-stabilized versions of SD and CG require small amounts of fault-detection, e.g., we may check only for NaNs and infinities. We test our approach experimentally by analyzing its convergence and overhead for different types and rates of faults.
  • Item
    Voxel-based offsetting at high resolution with tunable speed and precision using hybrid dynamic trees
    (Georgia Institute of Technology, 2016-11-11) Hossain, Mohammad Moazzem
    In the recent years, digital manufacturing has experienced the wave of rapid prototyping through the innovation and ubiquity in 3D printing technology. While such advancement liberates the constraints of shape selection in physical objects, 3D printing is yet to match the precision, robustness and vast applicability offered by the classical subtractive manufacturing process. To simplify the toolpath planning in conventional multi-axis CNC machining, recent researches have proposed adopting a voxel-based geometric modeling. Inherently, a voxel representation is amenable for parallel acceleration on modern ubiquitous GPU hardware. While there can be many different approaches to represent voxels, this work is based on a novel voxel data structure called hybrid dynamic tree (HDT) that combines dense grid and sparse octree in such a way that makes it both more compact (i.e., storage efficient) and better-suited to GPUs (i.e., computation effective) than state-of-the-art alternatives. This dissertation contributes in the following four aspects: First, we present a parallel method to construct the HDT representation on GPU for a CAD input modeled in a triangle mesh. In addition, to optimize the memory footprint in the HDT our research explores the theoretical storage analysis for different active node branchings in the Octree. Thus, we incorporate tunability into the HDT organization to study the complexity of memory footprint. The developed theoretical storage analysis is validated with rigorous experimentation that helps devising optimal parameter selections for storage-compact HDT representation. Next, the thesis presents a mathematical morphology based offsetting algorithm using the HDT voxel representation. At our target resolution of 4096 x 4096 x 4096, our goal is to compute large-scale offsets in minutes, match or beat the number of bits of the representation compared to state-of-the-art alternatives, and experimentally characterize any trade-offs among speed, storage, and precision. While using the HDT as the underlying data structure leads naturally to a storage-efficient representation, the challenge in developing a high-performance implementation of offset algorithm is choosing an optimal configuration of the HDT parameters.These parameters not only govern the memory footprint of the voxelized representation of the solid, but also control the parallel code execution efficiency on parallel computing units on GPU. Capability of fine-tuning of a data structure is crucial for understanding, and thereby optimizing, the developed computation-intensive algorithm that uses the HDT as the underlying voxel representation. Towards that end, this thesis explores different practical approaches to achieve high-performance voxel offsetting. First, we study the impact of the different HDT configurations on the voxel offsetting. Next, to devise a fast voxel offsetting we analyze the trade-offs between speed and accuracy through controllable size of the morphological filter. We study the impact of the decomposition of a large offset distance into a series of offsetting with smaller distances. To facilitate this trade-off analysis, we implement a GPU-accelerated error measurement technique. Finally, to enable even faster voxel offsetting, we present the principles of offloading the offset computation in the HDTs across a cluster of GPUs co-hosted on the same computing node. Our research studies the impact of different approaches for CUDA kernel execution controlled through either single or multiple independent CPU threads. In addition, we examine different load distribution policies that consider the computational disparity in the deployed GPUs. With more and more GPUs integrated on a single computing node, such exploration of algorithmic speedup through load-balanced implementation of voxel offsetting across multiple GPUs emphasizes the high scalability of the HDT's hybrid voxel representation.
  • Item
    The fast multipole method at exascale
    (Georgia Institute of Technology, 2013-11-26) Chandramowlishwaran, Aparna
    This thesis presents a top to bottom analysis on designing and implementing fast algorithms for current and future systems. We present new analysis, algorithmic techniques, and implementations of the Fast Multipole Method (FMM) for solving N- body problems. We target the FMM because it is broadly applicable to a variety of scientific particle simulations used to study electromagnetic, fluid, and gravitational phenomena, among others. Importantly, the FMM has asymptotically optimal time complexity with guaranteed approximation accuracy. As such, it is among the most attractive solutions for scalable particle simulation on future extreme scale systems. We specifically address two key challenges. The first challenge is how to engineer fast code for today’s platforms. We present the first in-depth study of multicore op- timizations and tuning for FMM, along with a systematic approach for transforming a conventionally-parallelized FMM into a highly-tuned one. We introduce novel opti- mizations that significantly improve the within-node scalability of the FMM, thereby enabling high-performance in the face of multicore and manycore systems. The second challenge is how to understand scalability on future systems. We present a new algorithmic complexity analysis of the FMM that considers both intra- and inter- node communication costs. Using these models, we present results for choosing the optimal algorithmic tuning parameter. This analysis also yields the surprising prediction that although the FMM is largely compute-bound today, and therefore highly scalable on current systems, the trajectory of processor architecture designs, if there are no significant changes could cause it to become communication-bound as early as the year 2015. This prediction suggests the utility of our analysis approach, which directly relates algorithmic and architectural characteristics, for enabling a new kind of highlevel algorithm-architecture co-design. To demonstrate the scientific significance of FMM, we present two applications namely, direct simulation of blood which is a multi-scale multi-physics problem and large-scale biomolecular electrostatics. MoBo (Moving Boundaries) is the infrastruc- ture for the direct numerical simulation of blood. It comprises of two key algorithmic components of which FMM is one. We were able to simulate blood flow using Stoke- sian dynamics on 200,000 cores of Jaguar, a peta-flop system and achieve a sustained performance of 0.7 Petaflop/s. The second application we propose as future work in this thesis is biomolecular electrostatics where we solve for the electrical potential using the boundary-integral formulation discretized with boundary element methods (BEM). The computational kernel in solving the large linear system is dense matrix vector multiply which we propose can be calculated using our scalable FMM. We propose to begin with the two dielectric problem where the electrostatic field is cal- culated using two continuum dielectric medium, the solvent and the molecule. This is only a first step to solving biologically challenging problems which have more than two dielectric medium, ion-exclusion layers, and solvent filled cavities. Finally, given the difficulty in producing high-performance scalable code, productivity is a key concern. Recently, numerical algorithms are being redesigned to take advantage of the architectural features of emerging multicore processors. These new classes of algorithms express fine-grained asynchronous parallelism and hence reduce the cost of synchronization. We performed the first extensive performance study of a recently proposed parallel programming model, called Concurrent Collections (CnC). In CnC, the programmer expresses her computation in terms of application-specific operations, partially-ordered by semantic scheduling constraints. The CnC model is well-suited to expressing asynchronous-parallel algorithms, so we evaluate CnC using two dense linear algebra algorithms in this style for execution on state-of-the-art mul- ticore systems. Our implementations in CnC was able to match and in some cases even exceed competing vendor-tuned and domain specific library codes. We combine these two distinct research efforts by expressing FMM in CnC, our approach tries to marry performance with productivity that will be critical on future systems. Looking forward, we would like to extend this to distributed memory machines, specifically implement FMM in the new distributed CnC, distCnC to express fine-grained paral- lelism which would require significant effort in alternative models.