Organizational Unit:

School of Computational Science and Engineering

Permanent Link

https://hdl.handle.net/1853/70780

Parent Organization

Organizational Unit

College of Computing

ArchiveSpace Name Record

https://finding-aids.library.gatech.edu/agents/corporate_entities/1111

Full item page

Publication Search Results

Now showing 1 - 10 of 11

Fast and compact neural network via Tensor-Train reparameterization

(Georgia Institute of Technology, 2023-08-28) Yin, Chunxing

The exponential growth of data and model size poses a number of challenges for deep learning training. Large neural network layers can be parameterized based on tensor decomposition to compress model size, but at the potential costs of degraded accuracy and more execution time to reconstruct the layer parameters from the tensorized representation. In this dissertation, we explore neural network compression through Tensor Train (TT) reparameterization. We aim to develop efficient algorithms to accelerate training of tensorized networks while minimizing the memory consumption, and to understand the necessary components for Tensor Train format to succeed in model compression. We design efficient algorithms to accelerate the training of tensorized layers in Convolutional Neural Networks (CNNs), Deep Learning Recommendation Models (DLRMs), and in Graph Neural Networks (GNNs). While the use of TT for compression in CNNs has been suggested in the past, the prior art has not demonstrated significant speedups for training or inference. The reason is that conventional implementations of TT-compressed convolutional layers pose several challenges: increases in computational work for reconstructing TT-compressed layers, increases in memory footprint due to weight reconstruction, and limitations to parallel scalability as the effective problem sizes shrink under compression. We address these issues through asymptotic reductions in computation, avoidance of data movement, and an alternative parallelization strategy that significantly improves scalability. In recommendation models, the performance of TT-compressed DLRM (TT-Rec) is further optimized with the batched matrix multiplication and caching strategies for embedding vector lookup operations. In addition, we present mathematically and empirically the effect of weight initialization distribution on DLRM accuracy and propose to initialize the tensor cores of TT-Rec following the sampled Gaussian distribution. In the next part of this dissertation, we study the node embeddings in graph neural networks where both the numerical features and topological graph information need to be preserved. We design training schemes that unify hierarchical tensor decomposition and graph topology to exploit graph homophily, as well as to develop novel parameter initialization algorithms that introduces graph spectrum to improve model convergence and accuracy. Finally, we evaluate our technique on million-node graphs to demonstrate the efficiency and accuracy in real-world graphs, as well as on synthetic graphs to understand the correlation between graph homophily and weight sharing in TT. While the primary focus of this dissertation lies in exploring proof-of-concept algorithms, its outcomes can hold significant implications for systems. For example, by transforming the data-intensive embedding operator to compute-intensive and memory-efficient tensorized embedding, we can potentially reconfigure the allocation of system resources within a heterogeneous data-center with a combination of CPUs and GPUs. Moreover, our compression technique would enable storing large modules on a limited-memory accelerator with data-parallelism, thereby providing opportunities for optimizing communication.
Multifidelity Memory System Simulation

(Georgia Institute of Technology, 2023-08-25) Lavin, Patrick

As computer systems grow larger and more complex, it takes more time to simulate them in detail. Researchers interested in simulating large systems must choose between simpler, less-accurate models or simulating smaller portions of their benchmarks, both of which can be highly manual, offline approaches that require time-consuming analysis by experts. Multifidelity simulation aims to lessen this burden by adapting the fidelity of a simulation to the complexity of the behavior being simulated. Multifidelity simulation refers to a simulation that can utilize multiple models for the same phenomena at different levels of fidelity. We borrow the phrase from the simulation of physical systems where scientists may have models with more or fewer terms, or may resolve their models on smaller or larger grid sizes, depending on the nature of the behavior at any point or time in the simulation. We have taken those ideas and applied them to computer architecture simulation. In this dissertation, we will present our novel multifidelity computer architecture simulation algorithm and implement it in two separate models: one for the cache and one for the entire memory system. Our cache model is able to automatically train and choose between low-fidelity models to adapt to the complexity of the modeled behavior online. The second model, the memory system, refines upon the ideas developed to create the first. We use statistical techniques to choose data that is used to create the low-fidelity models and implement this work as reusable components within a widely-used simulator, SST. This model achieves up to 2x speedup with only 1-5% mean error in the instructions per cycle.
Scalable Data Mining via Constrained Low Rank Approximation

(Georgia Institute of Technology, 2022-08-01) Eswar, Srinivas

Matrix and tensor approximation methods are recognised as foundational tools for modern data analytics. Their strength lies in their long history of rigorous and principled theoretical foundations, judicious formulations via various constraints, along with the availability of fast computer programs. Multiple Constrained Low Rank Approximation (CLRA) formulations exist for various commonly encountered tasks like clustering, dimensionality reduction, anomaly detection, amongst others. The primary challenge in modern data analytics is the sheer volume of data to be analysed, often requiring multiple machines to just hold the dataset in memory. This dissertation presents CLRA as a key enabler of scalable data mining in distributed-memory parallel machines. Nonnegative Matrix Factorisation (NMF) is the primary CLRA method studied in this dissertation. NMF imposes nonnegativity constraints on the factor matrices and is a well studied formulation known for its simplicity, interpretability, and clustering prowess. The major bottleneck in most NMF algorithms is a distributed matrix-multiplication kernel. We develop the Parallel Low rank Approximation with Nonnegativity Constraints (PLANC) software package, building on the earlier MPI-FAUN library, which includes an efficient matrix-multiplication kernel tailored to the CLRA case. It employs carefully designed parallel algorithms and data distributions to avoid unnecessary computation and communication. We extend PLANC to include several optimised Nonnegative Least-Squares (NLS) solvers and symmetric constraints, effectively employing the optimised matrix-multiplication kernel. We develop a parallel inexact Gauss-Newton algorithm for Symmetric Nonnegative Matrix Factorisation (SymNMF). In particular PLANC is able to efficiently utilise second-order information when imposing symmetry constraints without incurring the prohibitive memory and computational costs associated with these methods. We are able to observe 70% efficiency while scaling up these methods. We develop new parallel algorithms for fusing and analysing data with multiple modalities in the Joint Nonnegative Matrix Factorisation (JointNMF) context. JointNMF is capable of knowledge discovery when both feature-data and data-data information is present in a data source. We extend PLANC to handle this case of simultaneously approximating two different large input matrices and study the various trade-offs encountered in the bottleneck matrix-multiplication kernel. We show that these ideas translate naturally to the multilinear setting when data is presented in the form of a tensor. A bottleneck computation analogous to the matrix-multiply, the Matricised-Tensor Times Khatri-Rao Product (MTTKRP) kernel, is implemented. We conclude by describing some avenues for future research which extend the work and ideas in this dissertation. In particular, we consider the notion of structured sparsity, where the user has some control over the nonzero pattern, which appears in computations for various tasks like cross-validation, working with missing values, robust CLRA models, and in the semi-supervised setting.
Performance Primitives for Artificial Neural Networks

(Georgia Institute of Technology, 2021-05-10) Dukhan, Marat

Optimized software implementations of artificial neural networks leverage primitives from performance libraries, such as the BLAS. However, these primitives were prototyped decades ago, and do not necessarily reflect the patterns of computations in neural networks. I propose modifications to common primitives provided by performance libraries to make them better building blocks for artificial neural networks, with a focus on inference, i.e. evaluation of a pre-trained artificial neural network. I suggest three classes of performance primitives for the convolutional operators and two optimized building blocks for softmax operators. High-intensity convolutional operators with large kernel sizes and unit stride benefit from asymptotically fast convolution algorithms based on Winograd transform and Fast Fourier transform. I jointly consider Fourier or Winograd transform and the matrix-matrix multiplication of blocks of transformed coefficients and suggest tuple-GEMM primitive which balance the number of irregular memory writes in the transformation with sufficient register blocking and instruction-level parallelism in the matrix-matrix multiplication part. Tuple-GEMM primitive can be thought of as a batched GEMM with a fixed architecture-dependent batch size and can be efficiently implemented as a modification of the Goto matrix-matrix multiplication algorithm. I additionally analyze small 2D Fast Fourier transforms, and suggest options that work best for modern wide-SIMD processors. Lower-intensity convolutional operators with small kernel sizes, non-unit strides, or dilation do not benefit from the fast convolution algorithms and require a different set of optimizations. To accelerate these cases I suggest replacing the traditional GEMM primitive with a novel Indirect GEMM primitive. Indirect GEMM primitive is a slight modification of GEMM and can leverage the extensive research on efficient GEMM implementations. I further introduce the Indirect Convolution algorithm which builds on top of the Indirect GEMM primitive, eliminates the runtime overhead of patch-building memory transformations and substantially reduce the memory complexity in convolutional operators compared to the traditional GEMM-based algorithms. Pointwise, or 1x1, convolutional operators directly map to matrix-matrix multiplication, and prompt yet another approach to optimization. I demonstrate that neural networks heavy on pointwise convolutions can greatly benefit from sparsification of the weights tensor and representing the operation as a sparse-matrix-dense-matrix multiplication (SpMM) and introduce neural network-optimized SpMM primitives. While SpMM primitives in Sparse BLAS libraries target problems with extremely high sparsity (commonly 99+% sparsity) and non-random sparsity patterns, the proposed SpMM primitive is demonstrated to work well with moderate sparsity in the 70-95% range and unpredictable sparsity patterns. Softmax operator is light on elementary floating-point operations, but involves evaluation of the exponential function, which in many implementations becomes the bottleneck. I demonstrate that with the high-throughput vector exponential function the softmax computation saturates the memory bandwidth and can be further improved only by reducing the number of memory access operations. I then constructively prove that it is possible to replace the traditional three-pass softmax algorithms with a novel two-pass algorithm for up to 28% runtime reduction. I implemented the proposed ideas in the open source NNPACK, QNNPACK, and XNNPACK libraries for acceleration of neural networks on CPUs, which at the time of release delivered state-of-the-art performance on mobile, server, and Web platforms.
Diagnosing performance bottlenecks in HPC applications

(Georgia Institute of Technology, 2019-03-29) Czechowski, Kenneth

The software performance optimizations process is one of the most challenging aspects of developing highly performant code because underlying performance limitations are hard to diagnose. In many cases, identifying performance bottlenecks, such as latency stalls, requires a combination of fidelity and usability that existing tools do not provide: traditional performance models and runtime analysis lack the granularity necessary to uncover low-level bottlenecks; while, architectural simulations are too cumbersome and fragile to employ as a primary source of information. To address this need, we propose a performance analysis technique, called Pressure Point Analysis (PPA), which delivers the accessibility of analytical models with the precision of a simulator. The foundation of this approach is based on an autotuning-inspired technique that dynamically perturbs binary code (e.g., inserting/deleting instructions to affect utilization of functional units, altering memory access addresses to change cache hit rate, or swapping registers to alter instruction level dependencies) to then analyze the effects various perturbations have on the overall performance. When systematically applied, a battery of carefully designed perturbations, which target specific microarchitectural features, can glean valuable insight about pressure points in the code. PPA provides actionable information about hardware-software interactions that can be used by the software developer to manually tweak the application code. In some circumstances the performance bottlenecks are unavoidable, in which case this analysis can be used to establish a rigorous performance bound for the application. In other cases, this information can identify the primary performance limitations and project potential performance improvements if these bottlenecks are mitigated.
Automated surface finish inspection using convolutional neural networks

(Georgia Institute of Technology, 2019-03-25) Louhichi, Wafa

The surface finish of a machined part has an important effect on friction, wear, and aesthetics. The surface finish became a critical quality measure since 1980s mainly due to demands from automotive industry. Visual inspection and quality control have been traditionally done by human experts. Normally, it takes a substantial amount of operators time to stop the process and compare the quality of the produced piece with a surface roughness gauge. This manual process does not guarantee a consistent quality of the surface and is subject to human error and dependent upon the subjective opinion of the expert. Current advances in image processing, computer vision, and machine learning have created a path towards an automated surface finish inspection increasing the automation level of the whole process even further than it is now. In this thesis work, we propose a deep learning approach to replicate human judgment without using a surface roughness gauge. We used a Convolutional Neural Network (CNN) to train a surface finish classifier. Because of data scarcity, we generated our own image dataset of aluminum pieces produced from turning and boring operations on a Computer Numerical Control (CNC) lathe, which consists of a total of 980 training images, 160 validation images, and 140 test images. Considering the limited dataset and the computational cost of training deep neural networks from scratch, we applied transfer learning technique to models pre-trained on the publicly available ImageNet benchmark dataset. We used PyTorch Deep Learning framework and both CPU and GPU to train ResNet18 CNN. The training on CPU took 1h21min55s with a test accuracy of 97.14% while the training on GPU took 1min47s with a test accuracy = 97.86%. We used Keras API that runs on top of TensorFlow to train a MobileNet model. The training using Colaboratory’s GPU took 1h32m14s with an accuracy of 98.57%. The deep CNN models provided surprisingly high accuracy missclassifying only a few of 140 testing images. The MobileNet model allowed to run the inference efficiently on mobile devices. The affordable and easy-to-use solution provides a viable new method of automated surface inspection systems (ASIS).
Scalable tensor decompositions in high performance computing environments

(Georgia Institute of Technology, 2018-07-31) Li, Jiajia

This dissertation presents novel algorithmic techniques and data structures to help build scalable tensor decompositions on a variety of high-performance computing (HPC) platforms, including multicore CPUs, graphics co-processors (GPUs), and Intel Xeon Phi processors. A tensor may be regarded as a multiway array, generalizing matrices to more than two dimensions. When used to represent multifactor data, tensor methods can help analysts discover latent structure; this capability has found numerous applications in data modeling and mining in such domains as healthcare analytics, social networks analytics, computer vision, signal processing, and neuroscience, to name a few. When attempting to implement tensor algorithms efficiently on HPC platforms, there are several obstacles: the curse of dimensionality, mode orientation, tensor transformation, irregularity, and arbitrary tensor dimensions (or orders). These challenges result in non-trivial computational and storage overheads. This dissertation considers these challenges in the specific context of the two of the most popular tensor decompositions, the CANDECOMP/PARAFAC (CP) and Tucker decompositions, which are, roughly speaking, the tensor analogues to low-rank approximations in standard linear algebra. Within that context, two of the critical computational bottlenecks are the operations known as Tensor-Times-Matrix (TTM) and Matricized Tensor Times Khatri-Rao Product (MTTKRP). We consider these operations in cases when the tensor is dense or sparse. Our contributions include: 1) applying memoization to overcome the curse of dimensionality challenge that exists in a sequence of tensor operations; 2) addressing the challenge of mode orientation through a novel tensor format HICOO and proposing a parallel scheduler to avoid the locks for write-conflict memory; 3) carrying out TTM and MTTKRP operations in-place, for dense and sparse cases, to avoid tensor-matrix conversions; 4) employing different optimization and parameter tuning techniques for CPU and GPU implementations to conquer the challenges of the irregularity and arbitrary tensor orders. To validate these ideas, we have implemented them in three prototype libraries, named AdaTM, InTensLi, and ParTI!, for arbitrary-order tensors. AdaTM is a model-driven framework to generate an adaptive tensor memoization algorithm with the optimal parameters for sparse CP decomposition. InTensLi produces fast single-node implementations of dense TTM of an arbitrary dimension. ParTI! is short for a Parallel Tensor Infrastructure which is written in C, OpenMP, MPI, and NVIDIA CUDA for sparse tensors and supports MATLAB interfaces for application-level users.
Scalable and resilient sparse linear solvers

(Georgia Institute of Technology, 2018-05-22) Sao, Piyush kumar

Solving a large and sparse system of linear equations is a ubiquitous problem in scientific computing. The challenges in scaling such solvers on current and future parallel computer systems are the high-cost of communication and the expected decrease in reliability of the hardware components. This dissertation contributes new techniques to address these issues. Regarding communication, we make two advances to reduce both on-node and inter-node communication of distributed memory sparse direct solvers. On-node, we propose a novel technique, called the HALO, targeted at heterogeneous architectures consisting of multicore and hardware accelerator such as GPU or Xeon-Phi. The name HALO is a shorthand for highly asynchronous lazy offload, which refers to the way the method combines highly aggressive use of asynchrony with the accelerated offload, lazy updates, and data shadowing (a la Halo or ghost zones), all of which serve to hide and reduce communication, whether to local memory, across the network, or over PCIe. The overall hybrid solver achieves speed-up of up-to 3x on a variety of realistic test problems in single and multi-node configurations. To reduce inter-node communication, we present a novel communication-avoiding 3D sparse LU factorization algorithm. The 3D sparse LU factorization algorithm uses a three-dimensional logical arrangement of MPI processes and combines the data redundancy with the so-called elimination tree parallelism to reduce the communication. The 3D algorithm reduces the asymptotic communication costs by a factor of $O(\sqrt(log n))$ and latency costs by a factor of $O(log n)$ for planar sparse matrices arising from finite element discretization of two-dimensional PDEs. For the non-planar sparse matrices, it reduces the communication and latency costs by a constant factor. Beyond performance, we consider methods to improve solver resilience. In emerging and future systems with billions of computing elements, hardware faults during the execution may become a norm rather than an exception. We illustrate the principle of self-stabilization for constructing fault-tolerant iterative linear solvers. We give two proof-of-concept examples of self-stabilizing iterative linear solvers: one for steepest descent (SD) and one for conjugate gradients (CG). Our self-stabilized versions of SD and CG require small amounts of fault-detection, e.g., we may check only for NaNs and infinities. We test our approach experimentally by analyzing its convergence and overhead for different types and rates of faults.
Voxel-based offsetting at high resolution with tunable speed and precision using hybrid dynamic trees

(Georgia Institute of Technology, 2016-11-11) Hossain, Mohammad Moazzem

In the recent years, digital manufacturing has experienced the wave of rapid prototyping through the innovation and ubiquity in 3D printing technology. While such advancement liberates the constraints of shape selection in physical objects, 3D printing is yet to match the precision, robustness and vast applicability offered by the classical subtractive manufacturing process. To simplify the toolpath planning in conventional multi-axis CNC machining, recent researches have proposed adopting a voxel-based geometric modeling. Inherently, a voxel representation is amenable for parallel acceleration on modern ubiquitous GPU hardware. While there can be many different approaches to represent voxels, this work is based on a novel voxel data structure called hybrid dynamic tree (HDT) that combines dense grid and sparse octree in such a way that makes it both more compact (i.e., storage efficient) and better-suited to GPUs (i.e., computation effective) than state-of-the-art alternatives. This dissertation contributes in the following four aspects: First, we present a parallel method to construct the HDT representation on GPU for a CAD input modeled in a triangle mesh. In addition, to optimize the memory footprint in the HDT our research explores the theoretical storage analysis for different active node branchings in the Octree. Thus, we incorporate tunability into the HDT organization to study the complexity of memory footprint. The developed theoretical storage analysis is validated with rigorous experimentation that helps devising optimal parameter selections for storage-compact HDT representation. Next, the thesis presents a mathematical morphology based offsetting algorithm using the HDT voxel representation. At our target resolution of 4096 x 4096 x 4096, our goal is to compute large-scale offsets in minutes, match or beat the number of bits of the representation compared to state-of-the-art alternatives, and experimentally characterize any trade-offs among speed, storage, and precision. While using the HDT as the underlying data structure leads naturally to a storage-efficient representation, the challenge in developing a high-performance implementation of offset algorithm is choosing an optimal configuration of the HDT parameters.These parameters not only govern the memory footprint of the voxelized representation of the solid, but also control the parallel code execution efficiency on parallel computing units on GPU. Capability of fine-tuning of a data structure is crucial for understanding, and thereby optimizing, the developed computation-intensive algorithm that uses the HDT as the underlying voxel representation. Towards that end, this thesis explores different practical approaches to achieve high-performance voxel offsetting. First, we study the impact of the different HDT configurations on the voxel offsetting. Next, to devise a fast voxel offsetting we analyze the trade-offs between speed and accuracy through controllable size of the morphological filter. We study the impact of the decomposition of a large offset distance into a series of offsetting with smaller distances. To facilitate this trade-off analysis, we implement a GPU-accelerated error measurement technique. Finally, to enable even faster voxel offsetting, we present the principles of offloading the offset computation in the HDTs across a cluster of GPUs co-hosted on the same computing node. Our research studies the impact of different approaches for CUDA kernel execution controlled through either single or multiple independent CPU threads. In addition, we examine different load distribution policies that consider the computational disparity in the deployed GPUs. With more and more GPUs integrated on a single computing node, such exploration of algorithmic speedup through load-balanced implementation of voxel offsetting across multiple GPUs emphasizes the high scalability of the HDT's hybrid voxel representation.
Implementation and analysis of a parallel vertex-centered finite element segmental refinement multigrid solver

(Georgia Institute of Technology, 2016-04-28) Henneking, Stefan

In a parallel vertex-centered finite element multigrid solver, segmental refinement can be used to avoid all inter-process communication on the fine grids. While domain decomposition methods generally require coupled subdomain processing for the numerical solution to a nonlinear elliptic boundary value problem, segmental refinement exploits that subdomains are almost decoupled with respect to high-frequency error components. This allows to perform multigrid with fully decoupled subdomains on the fine grids, which was proposed as a sequential low-storage algorithm by Brandt in the 1970s, and as a parallel algorithm by Brandt and Diskin in 1994. Adams published the first numerical results from a multilevel segmental refinement solver in 2014, confirming the asymptotic exactness of the scheme for a cell-centered finite volume implementation. We continue Brandt’s and Adams’ research by experimentally investigating the scheme’s accuracy with a vertex-centered finite element segmental refinement solver. We confirm that full multigrid accuracy can be preserved for a few segmental refinement levels, although we observe a different dependency on the segmental refinement parameter space. We show that various strategies for the grid transfers between the finest conventional multigrid level and the segmental refinement subdomains affect the solver accuracy. Scaling results are reported for a Cray XC30 with up to 4096 cores.