Organizational Unit:
School of Computational Science and Engineering

Research Organization Registry ID
Description
Previous Names
Parent Organization
Parent Organization
Organizational Unit
Includes Organization(s)

Publication Search Results

Now showing 1 - 3 of 3
  • Item
    Voxel-based offsetting at high resolution with tunable speed and precision using hybrid dynamic trees
    (Georgia Institute of Technology, 2016-11-11) Hossain, Mohammad Moazzem
    In the recent years, digital manufacturing has experienced the wave of rapid prototyping through the innovation and ubiquity in 3D printing technology. While such advancement liberates the constraints of shape selection in physical objects, 3D printing is yet to match the precision, robustness and vast applicability offered by the classical subtractive manufacturing process. To simplify the toolpath planning in conventional multi-axis CNC machining, recent researches have proposed adopting a voxel-based geometric modeling. Inherently, a voxel representation is amenable for parallel acceleration on modern ubiquitous GPU hardware. While there can be many different approaches to represent voxels, this work is based on a novel voxel data structure called hybrid dynamic tree (HDT) that combines dense grid and sparse octree in such a way that makes it both more compact (i.e., storage efficient) and better-suited to GPUs (i.e., computation effective) than state-of-the-art alternatives. This dissertation contributes in the following four aspects: First, we present a parallel method to construct the HDT representation on GPU for a CAD input modeled in a triangle mesh. In addition, to optimize the memory footprint in the HDT our research explores the theoretical storage analysis for different active node branchings in the Octree. Thus, we incorporate tunability into the HDT organization to study the complexity of memory footprint. The developed theoretical storage analysis is validated with rigorous experimentation that helps devising optimal parameter selections for storage-compact HDT representation. Next, the thesis presents a mathematical morphology based offsetting algorithm using the HDT voxel representation. At our target resolution of 4096 x 4096 x 4096, our goal is to compute large-scale offsets in minutes, match or beat the number of bits of the representation compared to state-of-the-art alternatives, and experimentally characterize any trade-offs among speed, storage, and precision. While using the HDT as the underlying data structure leads naturally to a storage-efficient representation, the challenge in developing a high-performance implementation of offset algorithm is choosing an optimal configuration of the HDT parameters.These parameters not only govern the memory footprint of the voxelized representation of the solid, but also control the parallel code execution efficiency on parallel computing units on GPU. Capability of fine-tuning of a data structure is crucial for understanding, and thereby optimizing, the developed computation-intensive algorithm that uses the HDT as the underlying voxel representation. Towards that end, this thesis explores different practical approaches to achieve high-performance voxel offsetting. First, we study the impact of the different HDT configurations on the voxel offsetting. Next, to devise a fast voxel offsetting we analyze the trade-offs between speed and accuracy through controllable size of the morphological filter. We study the impact of the decomposition of a large offset distance into a series of offsetting with smaller distances. To facilitate this trade-off analysis, we implement a GPU-accelerated error measurement technique. Finally, to enable even faster voxel offsetting, we present the principles of offloading the offset computation in the HDTs across a cluster of GPUs co-hosted on the same computing node. Our research studies the impact of different approaches for CUDA kernel execution controlled through either single or multiple independent CPU threads. In addition, we examine different load distribution policies that consider the computational disparity in the deployed GPUs. With more and more GPUs integrated on a single computing node, such exploration of algorithmic speedup through load-balanced implementation of voxel offsetting across multiple GPUs emphasizes the high scalability of the HDT's hybrid voxel representation.
  • Item
    Implementation and analysis of a parallel vertex-centered finite element segmental refinement multigrid solver
    (Georgia Institute of Technology, 2016-04-28) Henneking, Stefan
    In a parallel vertex-centered finite element multigrid solver, segmental refinement can be used to avoid all inter-process communication on the fine grids. While domain decomposition methods generally require coupled subdomain processing for the numerical solution to a nonlinear elliptic boundary value problem, segmental refinement exploits that subdomains are almost decoupled with respect to high-frequency error components. This allows to perform multigrid with fully decoupled subdomains on the fine grids, which was proposed as a sequential low-storage algorithm by Brandt in the 1970s, and as a parallel algorithm by Brandt and Diskin in 1994. Adams published the first numerical results from a multilevel segmental refinement solver in 2014, confirming the asymptotic exactness of the scheme for a cell-centered finite volume implementation. We continue Brandt’s and Adams’ research by experimentally investigating the scheme’s accuracy with a vertex-centered finite element segmental refinement solver. We confirm that full multigrid accuracy can be preserved for a few segmental refinement levels, although we observe a different dependency on the segmental refinement parameter space. We show that various strategies for the grid transfers between the finest conventional multigrid level and the segmental refinement subdomains affect the solver accuracy. Scaling results are reported for a Cray XC30 with up to 4096 cores.
  • Item
    The fast multipole method at exascale
    (Georgia Institute of Technology, 2013-11-26) Chandramowlishwaran, Aparna
    This thesis presents a top to bottom analysis on designing and implementing fast algorithms for current and future systems. We present new analysis, algorithmic techniques, and implementations of the Fast Multipole Method (FMM) for solving N- body problems. We target the FMM because it is broadly applicable to a variety of scientific particle simulations used to study electromagnetic, fluid, and gravitational phenomena, among others. Importantly, the FMM has asymptotically optimal time complexity with guaranteed approximation accuracy. As such, it is among the most attractive solutions for scalable particle simulation on future extreme scale systems. We specifically address two key challenges. The first challenge is how to engineer fast code for today’s platforms. We present the first in-depth study of multicore op- timizations and tuning for FMM, along with a systematic approach for transforming a conventionally-parallelized FMM into a highly-tuned one. We introduce novel opti- mizations that significantly improve the within-node scalability of the FMM, thereby enabling high-performance in the face of multicore and manycore systems. The second challenge is how to understand scalability on future systems. We present a new algorithmic complexity analysis of the FMM that considers both intra- and inter- node communication costs. Using these models, we present results for choosing the optimal algorithmic tuning parameter. This analysis also yields the surprising prediction that although the FMM is largely compute-bound today, and therefore highly scalable on current systems, the trajectory of processor architecture designs, if there are no significant changes could cause it to become communication-bound as early as the year 2015. This prediction suggests the utility of our analysis approach, which directly relates algorithmic and architectural characteristics, for enabling a new kind of highlevel algorithm-architecture co-design. To demonstrate the scientific significance of FMM, we present two applications namely, direct simulation of blood which is a multi-scale multi-physics problem and large-scale biomolecular electrostatics. MoBo (Moving Boundaries) is the infrastruc- ture for the direct numerical simulation of blood. It comprises of two key algorithmic components of which FMM is one. We were able to simulate blood flow using Stoke- sian dynamics on 200,000 cores of Jaguar, a peta-flop system and achieve a sustained performance of 0.7 Petaflop/s. The second application we propose as future work in this thesis is biomolecular electrostatics where we solve for the electrical potential using the boundary-integral formulation discretized with boundary element methods (BEM). The computational kernel in solving the large linear system is dense matrix vector multiply which we propose can be calculated using our scalable FMM. We propose to begin with the two dielectric problem where the electrostatic field is cal- culated using two continuum dielectric medium, the solvent and the molecule. This is only a first step to solving biologically challenging problems which have more than two dielectric medium, ion-exclusion layers, and solvent filled cavities. Finally, given the difficulty in producing high-performance scalable code, productivity is a key concern. Recently, numerical algorithms are being redesigned to take advantage of the architectural features of emerging multicore processors. These new classes of algorithms express fine-grained asynchronous parallelism and hence reduce the cost of synchronization. We performed the first extensive performance study of a recently proposed parallel programming model, called Concurrent Collections (CnC). In CnC, the programmer expresses her computation in terms of application-specific operations, partially-ordered by semantic scheduling constraints. The CnC model is well-suited to expressing asynchronous-parallel algorithms, so we evaluate CnC using two dense linear algebra algorithms in this style for execution on state-of-the-art mul- ticore systems. Our implementations in CnC was able to match and in some cases even exceed competing vendor-tuned and domain specific library codes. We combine these two distinct research efforts by expressing FMM in CnC, our approach tries to marry performance with productivity that will be critical on future systems. Looking forward, we would like to extend this to distributed memory machines, specifically implement FMM in the new distributed CnC, distCnC to express fine-grained paral- lelism which would require significant effort in alternative models.