Organizational Unit:
School of Computational Science and Engineering
School of Computational Science and Engineering
Permanent Link
Research Organization Registry ID
Description
Previous Names
Parent Organization
Parent Organization
Organizational Unit
Includes Organization(s)
ArchiveSpace Name Record
Publication Search Results
Now showing
1  9 of 9

ItemScalable tensor decompositions in high performance computing environments(Georgia Institute of Technology, 20180731) Li, Jiajia ; Vuduc, Richard ; Sun, Jimeng ; Çatalyürek, Ümit V. ; Kolda, Tamara G. ; Ucar, Bora ; Bader, David A. ; Computational Science and EngineeringThis dissertation presents novel algorithmic techniques and data structures to help build scalable tensor decompositions on a variety of highperformance computing (HPC) platforms, including multicore CPUs, graphics coprocessors (GPUs), and Intel Xeon Phi processors. A tensor may be regarded as a multiway array, generalizing matrices to more than two dimensions. When used to represent multifactor data, tensor methods can help analysts discover latent structure; this capability has found numerous applications in data modeling and mining in such domains as healthcare analytics, social networks analytics, computer vision, signal processing, and neuroscience, to name a few. When attempting to implement tensor algorithms efficiently on HPC platforms, there are several obstacles: the curse of dimensionality, mode orientation, tensor transformation, irregularity, and arbitrary tensor dimensions (or orders). These challenges result in nontrivial computational and storage overheads. This dissertation considers these challenges in the specific context of the two of the most popular tensor decompositions, the CANDECOMP/PARAFAC (CP) and Tucker decompositions, which are, roughly speaking, the tensor analogues to lowrank approximations in standard linear algebra. Within that context, two of the critical computational bottlenecks are the operations known as TensorTimesMatrix (TTM) and Matricized Tensor Times KhatriRao Product (MTTKRP). We consider these operations in cases when the tensor is dense or sparse. Our contributions include: 1) applying memoization to overcome the curse of dimensionality challenge that exists in a sequence of tensor operations; 2) addressing the challenge of mode orientation through a novel tensor format HICOO and proposing a parallel scheduler to avoid the locks for writeconflict memory; 3) carrying out TTM and MTTKRP operations inplace, for dense and sparse cases, to avoid tensormatrix conversions; 4) employing different optimization and parameter tuning techniques for CPU and GPU implementations to conquer the challenges of the irregularity and arbitrary tensor orders. To validate these ideas, we have implemented them in three prototype libraries, named AdaTM, InTensLi, and ParTI!, for arbitraryorder tensors. AdaTM is a modeldriven framework to generate an adaptive tensor memoization algorithm with the optimal parameters for sparse CP decomposition. InTensLi produces fast singlenode implementations of dense TTM of an arbitrary dimension. ParTI! is short for a Parallel Tensor Infrastructure which is written in C, OpenMP, MPI, and NVIDIA CUDA for sparse tensors and supports MATLAB interfaces for applicationlevel users.

ItemThe fast multipole method at exascale(Georgia Institute of Technology, 20131126) Chandramowlishwaran, Aparna ; Vuduc, Richard ; Bader, David ; Biros, George ; Barba, Lorena ; Knobe, Kathleen ; Computational Science and EngineeringThis thesis presents a top to bottom analysis on designing and implementing fast algorithms for current and future systems. We present new analysis, algorithmic techniques, and implementations of the Fast Multipole Method (FMM) for solving N body problems. We target the FMM because it is broadly applicable to a variety of scientific particle simulations used to study electromagnetic, fluid, and gravitational phenomena, among others. Importantly, the FMM has asymptotically optimal time complexity with guaranteed approximation accuracy. As such, it is among the most attractive solutions for scalable particle simulation on future extreme scale systems. We specifically address two key challenges. The first challenge is how to engineer fast code for today’s platforms. We present the first indepth study of multicore op timizations and tuning for FMM, along with a systematic approach for transforming a conventionallyparallelized FMM into a highlytuned one. We introduce novel opti mizations that significantly improve the withinnode scalability of the FMM, thereby enabling highperformance in the face of multicore and manycore systems. The second challenge is how to understand scalability on future systems. We present a new algorithmic complexity analysis of the FMM that considers both intra and inter node communication costs. Using these models, we present results for choosing the optimal algorithmic tuning parameter. This analysis also yields the surprising prediction that although the FMM is largely computebound today, and therefore highly scalable on current systems, the trajectory of processor architecture designs, if there are no significant changes could cause it to become communicationbound as early as the year 2015. This prediction suggests the utility of our analysis approach, which directly relates algorithmic and architectural characteristics, for enabling a new kind of highlevel algorithmarchitecture codesign. To demonstrate the scientific significance of FMM, we present two applications namely, direct simulation of blood which is a multiscale multiphysics problem and largescale biomolecular electrostatics. MoBo (Moving Boundaries) is the infrastruc ture for the direct numerical simulation of blood. It comprises of two key algorithmic components of which FMM is one. We were able to simulate blood flow using Stoke sian dynamics on 200,000 cores of Jaguar, a petaflop system and achieve a sustained performance of 0.7 Petaflop/s. The second application we propose as future work in this thesis is biomolecular electrostatics where we solve for the electrical potential using the boundaryintegral formulation discretized with boundary element methods (BEM). The computational kernel in solving the large linear system is dense matrix vector multiply which we propose can be calculated using our scalable FMM. We propose to begin with the two dielectric problem where the electrostatic field is cal culated using two continuum dielectric medium, the solvent and the molecule. This is only a first step to solving biologically challenging problems which have more than two dielectric medium, ionexclusion layers, and solvent filled cavities. Finally, given the difficulty in producing highperformance scalable code, productivity is a key concern. Recently, numerical algorithms are being redesigned to take advantage of the architectural features of emerging multicore processors. These new classes of algorithms express finegrained asynchronous parallelism and hence reduce the cost of synchronization. We performed the first extensive performance study of a recently proposed parallel programming model, called Concurrent Collections (CnC). In CnC, the programmer expresses her computation in terms of applicationspecific operations, partiallyordered by semantic scheduling constraints. The CnC model is wellsuited to expressing asynchronousparallel algorithms, so we evaluate CnC using two dense linear algebra algorithms in this style for execution on stateoftheart mul ticore systems. Our implementations in CnC was able to match and in some cases even exceed competing vendortuned and domain specific library codes. We combine these two distinct research efforts by expressing FMM in CnC, our approach tries to marry performance with productivity that will be critical on future systems. Looking forward, we would like to extend this to distributed memory machines, specifically implement FMM in the new distributed CnC, distCnC to express finegrained paral lelism which would require significant effort in alternative models.

ItemPerformance Primitives for Artificial Neural Networks(Georgia Institute of Technology, 20210510) Dukhan, Marat ; Vuduc, Richard ; Chow, Edmond T ; Essa, Irfan A ; van de Geijn, Robert ; Vasilache, Nicolas ; Hammond, Jeff ; Computational Science and EngineeringOptimized software implementations of artificial neural networks leverage primitives from performance libraries, such as the BLAS. However, these primitives were prototyped decades ago, and do not necessarily reflect the patterns of computations in neural networks. I propose modifications to common primitives provided by performance libraries to make them better building blocks for artificial neural networks, with a focus on inference, i.e. evaluation of a pretrained artificial neural network. I suggest three classes of performance primitives for the convolutional operators and two optimized building blocks for softmax operators. Highintensity convolutional operators with large kernel sizes and unit stride benefit from asymptotically fast convolution algorithms based on Winograd transform and Fast Fourier transform. I jointly consider Fourier or Winograd transform and the matrixmatrix multiplication of blocks of transformed coefficients and suggest tupleGEMM primitive which balance the number of irregular memory writes in the transformation with sufficient register blocking and instructionlevel parallelism in the matrixmatrix multiplication part. TupleGEMM primitive can be thought of as a batched GEMM with a fixed architecturedependent batch size and can be efficiently implemented as a modification of the Goto matrixmatrix multiplication algorithm. I additionally analyze small 2D Fast Fourier transforms, and suggest options that work best for modern wideSIMD processors. Lowerintensity convolutional operators with small kernel sizes, nonunit strides, or dilation do not benefit from the fast convolution algorithms and require a different set of optimizations. To accelerate these cases I suggest replacing the traditional GEMM primitive with a novel Indirect GEMM primitive. Indirect GEMM primitive is a slight modification of GEMM and can leverage the extensive research on efficient GEMM implementations. I further introduce the Indirect Convolution algorithm which builds on top of the Indirect GEMM primitive, eliminates the runtime overhead of patchbuilding memory transformations and substantially reduce the memory complexity in convolutional operators compared to the traditional GEMMbased algorithms. Pointwise, or 1x1, convolutional operators directly map to matrixmatrix multiplication, and prompt yet another approach to optimization. I demonstrate that neural networks heavy on pointwise convolutions can greatly benefit from sparsification of the weights tensor and representing the operation as a sparsematrixdensematrix multiplication (SpMM) and introduce neural networkoptimized SpMM primitives. While SpMM primitives in Sparse BLAS libraries target problems with extremely high sparsity (commonly 99+% sparsity) and nonrandom sparsity patterns, the proposed SpMM primitive is demonstrated to work well with moderate sparsity in the 7095% range and unpredictable sparsity patterns. Softmax operator is light on elementary floatingpoint operations, but involves evaluation of the exponential function, which in many implementations becomes the bottleneck. I demonstrate that with the highthroughput vector exponential function the softmax computation saturates the memory bandwidth and can be further improved only by reducing the number of memory access operations. I then constructively prove that it is possible to replace the traditional threepass softmax algorithms with a novel twopass algorithm for up to 28% runtime reduction. I implemented the proposed ideas in the open source NNPACK, QNNPACK, and XNNPACK libraries for acceleration of neural networks on CPUs, which at the time of release delivered stateoftheart performance on mobile, server, and Web platforms.

ItemDiagnosing performance bottlenecks in HPC applications(Georgia Institute of Technology, 20190329) Czechowski, Kenneth ; Vuduc, Richard ; Chow, Edmond ; Kim, Hyesoon ; Lee, Victor W. ; √áataly√ºrek, √úmit V. ; Computational Science and EngineeringThe software performance optimizations process is one of the most challenging aspects of developing highly performant code because underlying performance limitations are hard to diagnose. In many cases, identifying performance bottlenecks, such as latency stalls, requires a combination of fidelity and usability that existing tools do not provide: traditional performance models and runtime analysis lack the granularity necessary to uncover lowlevel bottlenecks; while, architectural simulations are too cumbersome and fragile to employ as a primary source of information. To address this need, we propose a performance analysis technique, called Pressure Point Analysis (PPA), which delivers the accessibility of analytical models with the precision of a simulator. The foundation of this approach is based on an autotuninginspired technique that dynamically perturbs binary code (e.g., inserting/deleting instructions to affect utilization of functional units, altering memory access addresses to change cache hit rate, or swapping registers to alter instruction level dependencies) to then analyze the effects various perturbations have on the overall performance. When systematically applied, a battery of carefully designed perturbations, which target specific microarchitectural features, can glean valuable insight about pressure points in the code. PPA provides actionable information about hardwaresoftware interactions that can be used by the software developer to manually tweak the application code. In some circumstances the performance bottlenecks are unavoidable, in which case this analysis can be used to establish a rigorous performance bound for the application. In other cases, this information can identify the primary performance limitations and project potential performance improvements if these bottlenecks are mitigated.

ItemImplementation and analysis of a parallel vertexcentered finite element segmental refinement multigrid solver(Georgia Institute of Technology, 20160428) Henneking, Stefan ; Vuduc, Richard ; Chow, Edmond ; Zhou, HaoMin ; Adams, Mark F. ; Computational Science and EngineeringIn a parallel vertexcentered finite element multigrid solver, segmental refinement can be used to avoid all interprocess communication on the fine grids. While domain decomposition methods generally require coupled subdomain processing for the numerical solution to a nonlinear elliptic boundary value problem, segmental refinement exploits that subdomains are almost decoupled with respect to highfrequency error components. This allows to perform multigrid with fully decoupled subdomains on the fine grids, which was proposed as a sequential lowstorage algorithm by Brandt in the 1970s, and as a parallel algorithm by Brandt and Diskin in 1994. Adams published the first numerical results from a multilevel segmental refinement solver in 2014, confirming the asymptotic exactness of the scheme for a cellcentered finite volume implementation. We continue Brandt’s and Adams’ research by experimentally investigating the scheme’s accuracy with a vertexcentered finite element segmental refinement solver. We confirm that full multigrid accuracy can be preserved for a few segmental refinement levels, although we observe a different dependency on the segmental refinement parameter space. We show that various strategies for the grid transfers between the finest conventional multigrid level and the segmental refinement subdomains affect the solver accuracy. Scaling results are reported for a Cray XC30 with up to 4096 cores.

ItemScalable Data Mining via Constrained Low Rank Approximation(Georgia Institute of Technology, 20220801) Eswar, Srinivas ; Vuduc, Richard ; Park, Haesun ; Catalyurek, Umit ; Chow, Edmond ; Ballard, Grey ; Computational Science and EngineeringMatrix and tensor approximation methods are recognised as foundational tools for modern data analytics. Their strength lies in their long history of rigorous and principled theoretical foundations, judicious formulations via various constraints, along with the availability of fast computer programs. Multiple Constrained Low Rank Approximation (CLRA) formulations exist for various commonly encountered tasks like clustering, dimensionality reduction, anomaly detection, amongst others. The primary challenge in modern data analytics is the sheer volume of data to be analysed, often requiring multiple machines to just hold the dataset in memory. This dissertation presents CLRA as a key enabler of scalable data mining in distributedmemory parallel machines. Nonnegative Matrix Factorisation (NMF) is the primary CLRA method studied in this dissertation. NMF imposes nonnegativity constraints on the factor matrices and is a well studied formulation known for its simplicity, interpretability, and clustering prowess. The major bottleneck in most NMF algorithms is a distributed matrixmultiplication kernel. We develop the Parallel Low rank Approximation with Nonnegativity Constraints (PLANC) software package, building on the earlier MPIFAUN library, which includes an efficient matrixmultiplication kernel tailored to the CLRA case. It employs carefully designed parallel algorithms and data distributions to avoid unnecessary computation and communication. We extend PLANC to include several optimised Nonnegative LeastSquares (NLS) solvers and symmetric constraints, effectively employing the optimised matrixmultiplication kernel. We develop a parallel inexact GaussNewton algorithm for Symmetric Nonnegative Matrix Factorisation (SymNMF). In particular PLANC is able to efficiently utilise secondorder information when imposing symmetry constraints without incurring the prohibitive memory and computational costs associated with these methods. We are able to observe 70% efficiency while scaling up these methods. We develop new parallel algorithms for fusing and analysing data with multiple modalities in the Joint Nonnegative Matrix Factorisation (JointNMF) context. JointNMF is capable of knowledge discovery when both featuredata and datadata information is present in a data source. We extend PLANC to handle this case of simultaneously approximating two different large input matrices and study the various tradeoffs encountered in the bottleneck matrixmultiplication kernel. We show that these ideas translate naturally to the multilinear setting when data is presented in the form of a tensor. A bottleneck computation analogous to the matrixmultiply, the MatricisedTensor Times KhatriRao Product (MTTKRP) kernel, is implemented. We conclude by describing some avenues for future research which extend the work and ideas in this dissertation. In particular, we consider the notion of structured sparsity, where the user has some control over the nonzero pattern, which appears in computations for various tasks like crossvalidation, working with missing values, robust CLRA models, and in the semisupervised setting.

ItemAutomated surface finish inspection using convolutional neural networks(Georgia Institute of Technology, 20190325) Louhichi, Wafa ; Kurfess, Thomas R. ; Vuduc, Richard ; Saldana, Christopher ; Chau, Duen Horng ; Computational Science and EngineeringThe surface finish of a machined part has an important effect on friction, wear, and aesthetics. The surface finish became a critical quality measure since 1980s mainly due to demands from automotive industry. Visual inspection and quality control have been traditionally done by human experts. Normally, it takes a substantial amount of operators time to stop the process and compare the quality of the produced piece with a surface roughness gauge. This manual process does not guarantee a consistent quality of the surface and is subject to human error and dependent upon the subjective opinion of the expert. Current advances in image processing, computer vision, and machine learning have created a path towards an automated surface finish inspection increasing the automation level of the whole process even further than it is now. In this thesis work, we propose a deep learning approach to replicate human judgment without using a surface roughness gauge. We used a Convolutional Neural Network (CNN) to train a surface finish classifier. Because of data scarcity, we generated our own image dataset of aluminum pieces produced from turning and boring operations on a Computer Numerical Control (CNC) lathe, which consists of a total of 980 training images, 160 validation images, and 140 test images. Considering the limited dataset and the computational cost of training deep neural networks from scratch, we applied transfer learning technique to models pretrained on the publicly available ImageNet benchmark dataset. We used PyTorch Deep Learning framework and both CPU and GPU to train ResNet18 CNN. The training on CPU took 1h21min55s with a test accuracy of 97.14% while the training on GPU took 1min47s with a test accuracy = 97.86%. We used Keras API that runs on top of TensorFlow to train a MobileNet model. The training using Colaboratory’s GPU took 1h32m14s with an accuracy of 98.57%. The deep CNN models provided surprisingly high accuracy missclassifying only a few of 140 testing images. The MobileNet model allowed to run the inference efficiently on mobile devices. The affordable and easytouse solution provides a viable new method of automated surface inspection systems (ASIS).

ItemVoxelbased offsetting at high resolution with tunable speed and precision using hybrid dynamic trees(Georgia Institute of Technology, 20161111) Hossain, Mohammad Moazzem ; Vuduc, Richard ; Kurfess, Thomas R. ; Rossignac, Jarek ; Young, Jeffrey ; Tucker, Thomas ; Computational Science and EngineeringIn the recent years, digital manufacturing has experienced the wave of rapid prototyping through the innovation and ubiquity in 3D printing technology. While such advancement liberates the constraints of shape selection in physical objects, 3D printing is yet to match the precision, robustness and vast applicability offered by the classical subtractive manufacturing process. To simplify the toolpath planning in conventional multiaxis CNC machining, recent researches have proposed adopting a voxelbased geometric modeling. Inherently, a voxel representation is amenable for parallel acceleration on modern ubiquitous GPU hardware. While there can be many different approaches to represent voxels, this work is based on a novel voxel data structure called hybrid dynamic tree (HDT) that combines dense grid and sparse octree in such a way that makes it both more compact (i.e., storage efficient) and bettersuited to GPUs (i.e., computation effective) than stateoftheart alternatives. This dissertation contributes in the following four aspects: First, we present a parallel method to construct the HDT representation on GPU for a CAD input modeled in a triangle mesh. In addition, to optimize the memory footprint in the HDT our research explores the theoretical storage analysis for different active node branchings in the Octree. Thus, we incorporate tunability into the HDT organization to study the complexity of memory footprint. The developed theoretical storage analysis is validated with rigorous experimentation that helps devising optimal parameter selections for storagecompact HDT representation. Next, the thesis presents a mathematical morphology based offsetting algorithm using the HDT voxel representation. At our target resolution of 4096 x 4096 x 4096, our goal is to compute largescale offsets in minutes, match or beat the number of bits of the representation compared to stateoftheart alternatives, and experimentally characterize any tradeoffs among speed, storage, and precision. While using the HDT as the underlying data structure leads naturally to a storageefficient representation, the challenge in developing a highperformance implementation of offset algorithm is choosing an optimal configuration of the HDT parameters.These parameters not only govern the memory footprint of the voxelized representation of the solid, but also control the parallel code execution efficiency on parallel computing units on GPU. Capability of finetuning of a data structure is crucial for understanding, and thereby optimizing, the developed computationintensive algorithm that uses the HDT as the underlying voxel representation. Towards that end, this thesis explores different practical approaches to achieve highperformance voxel offsetting. First, we study the impact of the different HDT configurations on the voxel offsetting. Next, to devise a fast voxel offsetting we analyze the tradeoffs between speed and accuracy through controllable size of the morphological filter. We study the impact of the decomposition of a large offset distance into a series of offsetting with smaller distances. To facilitate this tradeoff analysis, we implement a GPUaccelerated error measurement technique. Finally, to enable even faster voxel offsetting, we present the principles of offloading the offset computation in the HDTs across a cluster of GPUs cohosted on the same computing node. Our research studies the impact of different approaches for CUDA kernel execution controlled through either single or multiple independent CPU threads. In addition, we examine different load distribution policies that consider the computational disparity in the deployed GPUs. With more and more GPUs integrated on a single computing node, such exploration of algorithmic speedup through loadbalanced implementation of voxel offsetting across multiple GPUs emphasizes the high scalability of the HDT's hybrid voxel representation.

ItemScalable and resilient sparse linear solvers(Georgia Institute of Technology, 20180522) Sao, Piyush kumar ; Vuduc, Richard ; Li, Xiaoye S. ; Park, Haesun ; Chow, Edmond ; Zhou, HaoMin ; Catalayurek, Umit ; Computational Science and EngineeringSolving a large and sparse system of linear equations is a ubiquitous problem in scientific computing. The challenges in scaling such solvers on current and future parallel computer systems are the highcost of communication and the expected decrease in reliability of the hardware components. This dissertation contributes new techniques to address these issues. Regarding communication, we make two advances to reduce both onnode and internode communication of distributed memory sparse direct solvers. Onnode, we propose a novel technique, called the HALO, targeted at heterogeneous architectures consisting of multicore and hardware accelerator such as GPU or XeonPhi. The name HALO is a shorthand for highly asynchronous lazy offload, which refers to the way the method combines highly aggressive use of asynchrony with the accelerated offload, lazy updates, and data shadowing (a la Halo or ghost zones), all of which serve to hide and reduce communication, whether to local memory, across the network, or over PCIe. The overall hybrid solver achieves speedup of upto 3x on a variety of realistic test problems in single and multinode configurations. To reduce internode communication, we present a novel communicationavoiding 3D sparse LU factorization algorithm. The 3D sparse LU factorization algorithm uses a threedimensional logical arrangement of MPI processes and combines the data redundancy with the socalled elimination tree parallelism to reduce the communication. The 3D algorithm reduces the asymptotic communication costs by a factor of $O(\sqrt(log n))$ and latency costs by a factor of $O(log n)$ for planar sparse matrices arising from finite element discretization of twodimensional PDEs. For the nonplanar sparse matrices, it reduces the communication and latency costs by a constant factor. Beyond performance, we consider methods to improve solver resilience. In emerging and future systems with billions of computing elements, hardware faults during the execution may become a norm rather than an exception. We illustrate the principle of selfstabilization for constructing faulttolerant iterative linear solvers. We give two proofofconcept examples of selfstabilizing iterative linear solvers: one for steepest descent (SD) and one for conjugate gradients (CG). Our selfstabilized versions of SD and CG require small amounts of faultdetection, e.g., we may check only for NaNs and infinities. We test our approach experimentally by analyzing its convergence and overhead for different types and rates of faults.