Organizational Unit:

School of Computational Science and Engineering

Permanent Link

https://hdl.handle.net/1853/70780

Parent Organization

Organizational Unit

College of Computing

ArchiveSpace Name Record

https://finding-aids.library.gatech.edu/agents/corporate_entities/1111

Full item page

Publication Search Results

Now showing 1 - 10 of 185

Sentiment Search: Make the Internet your Focus Group

(Georgia Institute of Technology, 2022-12) Durrani, Faris ; Ahmed, Nemeth ; Zandstra, Justin ; Wang, RenChu ; Lakshmanan, Lakshmi Sree ; Lin, Shuyan

Sentiment Analysis has been used to identify changing moods of populations. In this project, we have analysed how sentiments revolve over topics from significant world events across social media platform (Facebook, Reddit, Twitter) and news sources (CNN, The New York Times, The Guardian). We have created an interactive visualization tool that allows to filter data on specific keywords and dates, and visualize time series sentiments along with the top words used in posts. This dashboard could potentially be used by businesses or political campaigns to analyze the effect of marketing strategies on public sentiment regarding their product, or to analyze the social climate surrounding certain ideas and issues on multiple platforms. Future steps could include a dynamic rendering of sentiments with new media posts using faster, more efficient algorithms.
Data Tiling for Sparse Computation

(Georgia Institute of Technology, 2022-11-11) An, Xiaojing

Many real-world data contain internal relationships. Efficient analysis of these relationship data is crucial for important problems including genome alignment, network vulnerability analysis, ranking web pages, among others. Such relationship data is frequently sparse and analysis on it is called sparse computation. We demonstrate that the important technique of data tiling is more powerful than previously known by broadening its application space. We focus on three important sparse computation areas: graph analysis, linear algebra, and bioinformatics. We demonstrate data tiling's power by addressing key issues and providing significant improvements---to both runtime and solution quality---in each area. For graph analysis, we focus on fast data tiling techniques that can produce well-structured tiles and demonstrate theoretical hardness results. These tiles are suitable for graph problems as they reduce data movement and ultimately improve end-to-end runtime performance. For linear algebra, we introduce a new cache-aware tiling technique and apply it to the key kernel of sparse matrix by sparse matrix multiplication. This technique tiles the second input matrix and then uses a small, summary matrix to guide access to the tiles during computation. Our approach results in the fastest known implementation across three distinct CPU architectures. In bioinformatics, we develop a tiling based de novo genome assembly pipeline. We start with reads and develop either a graph or hypergraph that captures internal relationships between reads. This is then tiled to minimize connections while maintaining balance. We then treat each resulting tile independently as the input to an existing, shared-memory assembler. Our pipeline improves existing state-of-the-art de novo genome assemblers and brings both runtime and quality improvements to them on both real-world and simulated datasets.
Scalable Data Mining via Constrained Low Rank Approximation

(Georgia Institute of Technology, 2022-08-01) Eswar, Srinivas

Matrix and tensor approximation methods are recognised as foundational tools for modern data analytics. Their strength lies in their long history of rigorous and principled theoretical foundations, judicious formulations via various constraints, along with the availability of fast computer programs. Multiple Constrained Low Rank Approximation (CLRA) formulations exist for various commonly encountered tasks like clustering, dimensionality reduction, anomaly detection, amongst others. The primary challenge in modern data analytics is the sheer volume of data to be analysed, often requiring multiple machines to just hold the dataset in memory. This dissertation presents CLRA as a key enabler of scalable data mining in distributed-memory parallel machines. Nonnegative Matrix Factorisation (NMF) is the primary CLRA method studied in this dissertation. NMF imposes nonnegativity constraints on the factor matrices and is a well studied formulation known for its simplicity, interpretability, and clustering prowess. The major bottleneck in most NMF algorithms is a distributed matrix-multiplication kernel. We develop the Parallel Low rank Approximation with Nonnegativity Constraints (PLANC) software package, building on the earlier MPI-FAUN library, which includes an efficient matrix-multiplication kernel tailored to the CLRA case. It employs carefully designed parallel algorithms and data distributions to avoid unnecessary computation and communication. We extend PLANC to include several optimised Nonnegative Least-Squares (NLS) solvers and symmetric constraints, effectively employing the optimised matrix-multiplication kernel. We develop a parallel inexact Gauss-Newton algorithm for Symmetric Nonnegative Matrix Factorisation (SymNMF). In particular PLANC is able to efficiently utilise second-order information when imposing symmetry constraints without incurring the prohibitive memory and computational costs associated with these methods. We are able to observe 70% efficiency while scaling up these methods. We develop new parallel algorithms for fusing and analysing data with multiple modalities in the Joint Nonnegative Matrix Factorisation (JointNMF) context. JointNMF is capable of knowledge discovery when both feature-data and data-data information is present in a data source. We extend PLANC to handle this case of simultaneously approximating two different large input matrices and study the various trade-offs encountered in the bottleneck matrix-multiplication kernel. We show that these ideas translate naturally to the multilinear setting when data is presented in the form of a tensor. A bottleneck computation analogous to the matrix-multiply, the Matricised-Tensor Times Khatri-Rao Product (MTTKRP) kernel, is implemented. We conclude by describing some avenues for future research which extend the work and ideas in this dissertation. In particular, we consider the notion of structured sparsity, where the user has some control over the nonzero pattern, which appears in computations for various tasks like cross-validation, working with missing values, robust CLRA models, and in the semi-supervised setting.
Efficient methods for read mapping

(Georgia Institute of Technology, 2022-08-01) Zhang, Haowen

DNA sequencing is the mainstay of biological and medical research. Modern sequencing machines can read millions of DNA fragments, sampling the underlying genomes at high-throughput. Mapping the resulting reads to a reference genome is typically the first step in sequencing data analysis. The problem has many variants as the reads can be short or long with a low or high error rate for different sequencing technologies, and the reference can be a single genome or a graph representation of multiple genomes. Therefore, it is crucial to develop efficient computational methods for these different problem classes. Moreover, continually declining sequencing costs and increasing throughput pose challenges to the previously developed methods and tools that cannot handle the growing volume of sequencing data. This dissertation seeks to advance the state-of-the-art in the established field of read mapping by proposing more efficient and scalable read mapping methods as well as tackling emerging new problem areas. Specifically, we design ultra-fast methods to map two types of reads: short reads for high-throughput chromatin profiling and nanopore raw reads for targeted sequencing in real-time. In tune with the characteristics of these types of reads, our methods can scale to larger sequencing data sets or map more reads correctly compared with the state-of-the-art mapping software. Furthermore, we propose two algorithms for aligning sequences to graphs, which is the foundation of mapping reads to graph-based reference genomes. One algorithm improves the time complexity of existing sequence to graph alignment algorithms for linear or affine gap penalty. The other algorithm provides good empirical performance in the case of the edit distance metric. Finally, we mathematically formulate the problem of validating paired-end read constraints when mapping sequences to graphs, and propose an exact algorithm that is also fast enough for practical use.
Robust Reservoir Computing Approaches for Predicting Cardiac Electrical Dynamics

(Georgia Institute of Technology, 2022-07-29) Shahi, Shahrokh

Computational modeling of cardiac electrophysiological signaling is of vital importance in understanding, preventing, and treating life-threatening arrhythmias. Traditionally, mathematical models incorporating physical principles have been used to study cardiac dynamical systems and can generate mechanistic insights, but their predictions are often quantitatively inaccurate due to model complexity, the lack of observability in the system, and variability within individuals and across the population. In contrast, machine-learning techniques can learn directly from training data, which in this context are time series of observed state variables, without prior knowledge of the system dynamics. The reservoir computing framework, a learning paradigm derived from recurrent neural network concepts and most commonly realized as an echo state network (ESN), offers a streamlined training process and holds promise to deliver more accurate predictions than mechanistic models. Accordingly, this research aims to develop robust ESN-based forecasting approaches for nonlinear cardiac electrodynamics, and thus presents the first application of machine-learning, and deep-learning techniques in particular, for modeling the complex electrical dynamics of cardiac cells and tissue. To accomplish this goal, we completed a set of three projects. (i) We compared the performance of available mainstream techniques for prediction with that of the baseline ESN approach along with several new ESN variants we proposed, including a physics-informed hybrid ESN. (ii) We proposed a novel integrated approach, the autoencoder echo state network (AE-ESN), that can accurately forecast the long-term future dynamics of cardiac electrical activity. This technique takes advantage of the best characteristics of both gated recurrent neural networks and ESNs by integrating a long short-term memory (LSTM) autoencoder into the ESN framework to improve reliability and robustness. (iii) We extended the long-term prediction of cardiac electrodynamics from a single cardiac cell to the tissue level, where, in addition to the temporal information, the data includes spatial dimensions and diffusive coupling. Building on the main design idea of the AE-ESN, a convolutional autoencoder was equipped with an ESN to create the Conv-ESN technique, which can process the spatiotemporal data and effectively capture the temporal dependencies between samples of data. Using these techniques, we forecast cardiac electrodynamics for a variety of datasets obtained in both in silico and in vitro experiments. We found that the proposed integrated approaches provide robust and computationally efficient techniques that can successfully predict the dynamics of electrical activity in cardiac cells and tissue with higher prediction accuracy than mainstream deep-learning approaches commonly used for predicting temporal data. On the application side, our approaches provide accurate forecasts over clinically useful time periods that could allow prediction of electrical problems with sufficient time for intervention and thus may support new types of treatments for some kinds of heart disease.
Deep generative models for solving geophysical inverse problems

(Georgia Institute of Technology, 2022-07-19) Siahkoohi, Ali

My thesis presents several novel methods to facilitate solving large-scale inverse problems by utilizing recent advances in machine learning, and particularly deep generative modeling. Inverse problems involve reliably estimating unknown parameters of a physical model from indirect observed data that are noisy. Solving inverse problems presents primarily two challenges. The first challenge is to capture and incorporate prior knowledge into ill-posed inverse problems whose solutions cannot be uniquely identified. The second challenge is the computational complexity of solving inverse problems, particularly the cost of quantifying uncertainty. The main goal of this thesis is to address these issues by developing practical data-driven methods that are scalable to geophysical applications in which access to high-quality training data is often limited. There are six papers included in this thesis. A majority of these papers focus on addressing computational challenges associated with Bayesian inference and uncertainty quantification, while others focus on developing regularization techniques to improve inverse problem solution quality and accelerate the solution process. These papers demonstrate the applicability of the proposed methods to seismic imaging, a large-scale geophysical inverse problem with a computationally expensive forward operator for which sufficiently capturing the variability in the Earth's heterogeneous subsurface through a training dataset is challenging. The first two papers present computationally feasible methods of applying a class of methods commonly referred to as deep priors to seismic imaging and uncertainty quantification. I also present a systematic Bayesian approach to translate uncertainty in seismic imaging to uncertainty in downstream tasks performed on the image. The next two papers aim to address the reliability concerns surrounding data-driven methods for solving Bayesian inverse problems by leveraging variational inference formulations that offer the benefits of fully-learned posteriors while being directly informed by physics and data. The last two papers are concerned with correcting forward modeling errors where the first proposes an adversarially learned postprocessing step to attenuate numerical dispersion artifacts in wave-equation simulations due to coarse finite-difference discretizations, while the second trains a Fourier neural operator surrogate forward model in order to accelerate the qualification of uncertainty due to errors in the forward model parameterization.
High-Performance Software for Quantum Chemistry and Hierarchical Matrices

(Georgia Institute of Technology, 2022-05-20) Erlandson, Lucas Alden

Linear algebra is the underpinning of a significant portion of the computation done in the modern age. Applications relying on linear algebra include physical and chemical simulations, machine learning, artificial intelligence, optimization, partial differential equations, and many more. However, the direct use of mathematically exact linear algebra is often infeasible for the large problems of today. Numerical and iterative methods provide a way of solving the underlying problems only to the required accuracy, allowing problems that are many magnitudes larger to be solved magnitudes more quickly than if the problems were to be solved using exact linear algebra. In this dissertation, we discuss, test existing methods, and develop new high-performance numerical methods for scientific computing kernels, including matrix-multiplications, linear solves, and eigensolves, which accelerate applications including Gaussian processes and quantum chemistry simulations. Notably, we use preconditioned hierarchical matrices for the hyperparameter optimization and prediction phases of Gaussian process regression, develop a sparse triple matrix product on GPUs, and investigate 3D matrix-matrix multiplications for Chebyshev-filtered subspace iteration for Kohn-Sham density functional theory calculations. The exploitation of the structural sparsity of many practical scientific problems can achieve a significant speedup over the dense formulations of the same problems. Even so, many problems cannot be accurately represented or approximated in a structurally sparse manner. Many of these problems, such as kernels arising from machine learning and the Electronic-Repulsion-Integral (ERI) matrices from electronic structure computations, can be accurately represented in data-sparse structures, which allows for rapid calculations. We investigate hierarchical matrices, which provide a data-sparse representation of kernel matrices. In particular, our SMASH approximation can construct and provide matrix multiplications in near-linear time, which can then be used in matrix-free methods to find the optimal hyperparameters for Gaussian processes and to do prediction asymptotically more rapidly than direct methods. To accelerate the use of hierarchical matrices further, we provide a data-driven approach (where we consider the distribution of the data points associated with a kernel matrix) that reduces a given problem's memory and computation requirements. Furthermore, we investigate the use of preconditioning in Gaussian process regression. We can use matrix-free algorithms for hyperparameter optimization and prediction phases of Gaussian process. This provides a framework for Gaussian process regression that scales to large-scale problems and is asymptotically faster than state-of-the-art methods. We provide an exploration and analysis of the conditioning and numerical issues that arise from the near-rank-deficient matrices that occur during hyperparameter optimizations. Density Functional Theory (DFT) is a valuable method for electronic structure calculations for simulating quantum chemical systems due to its high accuracy to cost ratio. However, even with the computational power of modern computers, the O(n^3) complexity of the eigensolves and other kernels mandate that new methods are developed to allow larger problems to be solved. Two promising methods for tackling these problems are using modern architectures (including state-of-the-art accelerators and multicore systems) and 3D matrix-multiplication algorithms. We investigate these methods to determine if using these methods will result in an overall speedup. Using these kernels, we provide a high-performance framework for Chebyshev-filtered subspace iteration. GPUs are a family of accelerators that provide immense computational power but must be used correctly to achieve good efficiency. In algebraic multigrid, there arises a sparse triple matrix product, which due to the sparse (and relatively unstructured) nature, is challenging to perform efficiently on GPUs, and is typically done as two successive matrix-matrix products. However, by doing a single triple-matrix product, reducing the overhead associated with sparse matrix-matrix products on the GPU may be possible. We develop a sparse triple-matrix product that reduces the computation time required for a few classes of problems.
DEEP LEARNING METHODS FOR MULTI-MODAL HEALTHCARE DATA

(Georgia Institute of Technology, 2022-05-19) Biswal, Siddharth

Abstract: Today, enormous transformations are happening in health care research and applications. In the past few years, there has been exponential growth in the amount of healthcare data generated from multiple sources. This growth in data has led to many new possibilities and opportunities for researchers to build different models and analytics for improving healthcare for patients. While there has been an increase in research and successful application of prediction and classification tasks, there are many other challenges in improving overall healthcare. Some of these challenges include optimizing physician performance, reducing healthcare costs, and discovering new treatments for diseases. - Often, doctors have to perform many time-consuming tasks, which leads to fatigue and misdiagnosis. Many of these tasks could be automated to save time and release doctors from menial tasks enabling them to spend more time improving the quality of care. - Health dataset contains multiple modalities such as structured sequence, unstructured text, images, ECG, and EEG signals. Successful application of machine learning requires methods to utilize these diverse data sources. - Finally, current healthcare is limited by the treatments available on the market. Often, many treatments do not make it beyond clinical trials, which leads to a lot of lost opportunities. It is possible to improve the outcome of clinical trials and ultimately improve the quality of treatment for the patients with machine learning models for different clinical trial-related tasks. In this dissertation, we address these challenges by - Predictive Models: Building deep learning models for sleep clinics to save time and effort needed by doctors for sleep staging, apnea, limb movement detection - Generative Models: Developing multimodal deep learning systems that can produce text reports and augment doctors in clinical practice. - Interpretable Representation Models: Applying multimodal models to help in clinical trial recruitment and counterfactual explanations for clinical trial outcome predictions to improve clinical trial success.
NEWS DATA VISUALIZATION INTERFACE DEVELOPMENT USING NMF ALGORITHM

(Georgia Institute of Technology, 2022-05-03) Ahn, Byeongsoo

News data is a super large-scale dataset. It covers a wide range of topics ranging from heavy topics such as politics and society to beauty and entertainment, relatively light topics. At the same time, it is also the most accessible source of information for the general public to obtain information. Thus, how is this large amount of data used by the general public being utilized? Currently, services provided by news platforms are just full article searches and related news recommendations. It uses only a fraction of the vast news dataset, and there is still a lack of systems to fully utilize and analyze it. As mentioned above, news datasets which contain a wide range of topics and super large scales of data, record everything that happened in the past and present, so analyzing and visualizing them can track how trends in real-world change over time and even discover what the topics of the large dataset are without reading the full text through topic modeling. For this objective, in this thesis, we propose a novel interactive visualization interface for the news data based on NMF to analyze, visualize, and utilize datasets more practically than simply searching the articles. Through this thesis, We first show the superior topic modeling performance of the NMF algorithm and the superior processing speed that can be used for interactive visual interface compared to other methods and then suggest the visual interface that contains various features to help users better analyze and intuitively understand the data. Finally, we present use cases on how this study can be used practically and present their applicability in various fields
Finding Dense Regions of Rapidly Changing Graphs

(Georgia Institute of Technology, 2022-05-02) Gabert, Kasimir Georg

Many of today's massive and rapidly changing graphs contain internal structure---hierarchies of locally dense regions---and finding and tracking this structure is key to detecting emerging behavior, exposing internal activity, summarizing for downstream tasks, identifying important regions, and more. Existing techniques to track these regions fundamentally cannot handle the scale, rate of change, and temporal nature of today's graphs. We identify the crucial missing piece as the need to address the significant variability in graph change rates, algorithm runtimes, temporal behavior, and dense structures themselves. We tackle tracking dense regions in three parts. First, we extend algorithms and theory around dense region computation. We computationally unify nuclei into computing hypergraph cores, providing significant improvements over hand-tuned nuclei algorithms and enabling higher-order nuclei. We develop new batch algorithms for maintaining core hierarchies. We then define new temporal dense regions, called core chains, that build on nuclei hierarchy maintenance and enable effective and powerful dense region tracking. Second, we scale up on shared-memory systems. We provide a parallel input and output library that reduces start-up costs of all known graph systems. We provide the first parallel scalable core and hypergraph core maintenance algorithms, building on the connection between $h$-indices and cores. This addresses computation on rapidly changing graphs during bursty periods with large numbers of graph changes. Third, we address scaling out to support massive graphs. We develop the first parallel $h$-index algorithm, the key kernel for tracking dense regions. We identify that system elasticity is imperative to handle large bursts of changes. We develop a dynamic and elastic graph system, using consistent hashing and sketches, and demonstrate competitive performance against static, inelastic graph systems while enabling new, dynamic applications. By addressing variability directly---in algorithm and system design---we break through previous barriers and bring dense region tracking to massive, rapidly changing graphs.