Organizational Unit:

School of Computational Science and Engineering

Permanent Link

https://hdl.handle.net/1853/70780

Parent Organization

Organizational Unit

College of Computing

ArchiveSpace Name Record

https://finding-aids.library.gatech.edu/agents/corporate_entities/1111

Full item page

Publication Search Results

Now showing 1 - 9 of 9

Scalable Algorithms for Hypergraph Analytics using Symmetric Tensor Decompositions

(Georgia Institute of Technology, 2023-08-28) Shivakumar, Shruti

Tensors are higher-dimension generalizations of matrices and are used to represent multi-dimensional data. Tensor-based methods are receiving renewed attention in recent years due to their prevalence in diverse real-world applications. Symmetric tensors are an important class of tensors, arising in diverse fields such as signal processing, machine learning, and hypergraph analytics. Hypergraphs, generalizations of graphs which allow edges to span multiple vertices, have become ubiquitous in understanding real-world networks and multi-entity interactions. Affinity relations in a hypergraph can be represented as a high-order adjacency tensor which is sparse and symmetric. While mathematical research on symmetric tensors is longstanding, emerging massive data in these applications has sparked the demand for scalable, efficient algorithms that utilize advances in numerical linear algebra, numerical optimization, as well as high-performance computing. State-of-the-art tensor libraries incorporate high-performance tensor methods for general sparse tensors; however, they lack specialized algorithms for sparse tensors that are symmetric. This dissertation focuses on scaling hypergraph analytics to real-world datasets by taking advantage of the sparsity and symmetry of the associated adjacency tensors through the development of compact storage formats and efficient serial and parallel algorithms for tensor operations. We present a novel computation-aware compressed storage format - CSS - for sparse symmetric tensors, along with efficient parallel algorithms for symmetric tensor operations that are compute- and memory-intensive due to the high tensor order and the associated factorial explosion in the number of non-zeros. In order to scale to large multi-entity complex networks, we consider the problem of distributed-memory hypergraph analytics. To that end, we present algorithms for parallel distributed-memory line graph construction of hypergraphs and demonstrate their application to large-scale symmetric adjacency tensor decomposition for hypergraph clustering. For hypergraphs with varying edge cardinalities, the CSS format has been extended to the CCSS format, using which we present a new shared-memory parallel algorithm for a key symmetric tensor kernel in the complutation of hypergraph tensor eigenvector centrality. Finally, we present Coupled Symmetric Tensor Completion (CoSTCo), a Riemannian optimization framework for the task of link prediction in non-uniform hypergraphs and analyze its performance with both synthetic and real-world datasets against state-of-the-art general tensor completion algorithms.
Efficient methods for read mapping

(Georgia Institute of Technology, 2022-08-01) Zhang, Haowen

DNA sequencing is the mainstay of biological and medical research. Modern sequencing machines can read millions of DNA fragments, sampling the underlying genomes at high-throughput. Mapping the resulting reads to a reference genome is typically the first step in sequencing data analysis. The problem has many variants as the reads can be short or long with a low or high error rate for different sequencing technologies, and the reference can be a single genome or a graph representation of multiple genomes. Therefore, it is crucial to develop efficient computational methods for these different problem classes. Moreover, continually declining sequencing costs and increasing throughput pose challenges to the previously developed methods and tools that cannot handle the growing volume of sequencing data. This dissertation seeks to advance the state-of-the-art in the established field of read mapping by proposing more efficient and scalable read mapping methods as well as tackling emerging new problem areas. Specifically, we design ultra-fast methods to map two types of reads: short reads for high-throughput chromatin profiling and nanopore raw reads for targeted sequencing in real-time. In tune with the characteristics of these types of reads, our methods can scale to larger sequencing data sets or map more reads correctly compared with the state-of-the-art mapping software. Furthermore, we propose two algorithms for aligning sequences to graphs, which is the foundation of mapping reads to graph-based reference genomes. One algorithm improves the time complexity of existing sequence to graph alignment algorithms for linear or affine gap penalty. The other algorithm provides good empirical performance in the case of the edit distance metric. Finally, we mathematically formulate the problem of validating paired-end read constraints when mapping sequences to graphs, and propose an exact algorithm that is also fast enough for practical use.
Parallel Algorithms and Generalized Frameworks for Learning Large-Scale Bayesian Networks

(Georgia Institute of Technology, 2021-08-13) Srivastava, Ankit

Bayesian networks (BNs) are an important subclass of probabilistic graphical models that employ directed acyclic graphs to compactly represent exponential-sized joint probability distributions over a set of random variables. Since BNs enable probabilistic reasoning about interactions between the variables of interest, they have been successfully applied in a wide range of applications in the fields of medical diagnosis, gene networks, cybersecurity, epidemiology, etc. Furthermore, the recent focus on the need for explainability in human-impact decisions made by machine learning (ML) models has led to a push for replacing the prevalent black-box models with inherently interpretable models like BNs for making high-stakes decisions in hitherto unexplored areas. Learning the exact structure of BNs from observational data is an NP-hard problem and therefore a wide range of heuristic algorithms have been developed for this purpose. However, even the heuristic algorithms are computationally intensive. The existing software packages for BN structure learning with implementations of multiple algorithms are either completely sequential or support limited parallelism and can take days to learn BNs with even a few thousand variables. Previous parallelization efforts have focused on one or two algorithms for specific applications and have not resulted in broadly applicable parallel software. This has prevented BNs from becoming a viable alternative to other ML models. In this dissertation, we develop efficient parallel versions of a variety of BN learning algorithms from two categories: six different constraint-based methods and a score-based method for constructing a specialization of BNs known as module networks. We also propose optimizations for the implementations of these parallel algorithms to achieve maximum performance in practice. Our proposed algorithms are scalable to thousands of cores and outperform the previous state-of-the-art by a large margin. We have made the implementations available as open-source software packages that can be used by ML and application-domain researchers for expeditious learning of large-scale BNs.
On Using Inductive Biases for Designing Deep Learning Architectures

(Georgia Institute of Technology, 2020-12-15) Shrivastava, Harsh

Recent advancements in field of Artificial Intelligence, especially in the field of Deep Learning (DL), have paved way for new and improved solutions to complex problems occurring in almost all domains. Often we have some prior knowledge and beliefs of the underlying system of the problem at-hand which we want to capture in the corresponding deep learning architectures. Sometimes, it is not clear on how to include our prior beliefs into the traditionally recommended deep architectures like Recurrent neural networks, Convolutional neural networks, Variational Autoencoders and others. Often the post-hoc techniques of modifying these architectures are not straightforward and provide little performance gain. There have been efforts on developing domain specific architectures but those techniques are generally not transferable to other domains. We ask the question that can we come up with generic and intuitive techniques to design deep learning architectures that takes our prior knowledge of the system as an inductive bias? In this dissertation, we develop two novel approaches towards this end. The first one called `Cooperative Neural Networks' can incorporate the inductive bias from the underlying probabilistic graphical model representation of the domain. The second one called problem dependent `Unrolled Algorithms' parameterizes the recurrent structure of unrolling the iterations of an optimization algorithm for the objective function defining the problem. We found that the neural network architectures obtained from our approaches typically end up with very fewer learnable parameters and provide considerable improvement in run-time compared to other deep learning methods. We have successfully applied our techniques to solve Natural Language processing related tasks, doing sparse graph recovery and computational biology problems like doing gene regulatory network inference. Firstly, we introduce the Cooperative Neural Networks approach which is a new theoretical approach for implementing learning systems that can exploit both prior insights about the independence structure of the problem domain and the universal approximation capability of the deep neural networks. Specifically, we develop CoNN-sLDA model for the document classification task. We use the popular Latent Dirichlet Allocation graphical model as the inductive bias for the CoNN-sLDA model. We demonstrate a 23% reduction in error on the challenging MultiSent data set compared to state-of-the-art and also derived ways to make the learned representations more interpretable. Secondly, we elucidate the idea of using problem dependent `Unrolled Algorithms' for the sparse graph recovery task. We propose a deep learning architecture, GLAD, which uses an Alternating Minimization algorithm as our model inductive bias and learns the model parameters via supervised learning. We show that GLAD learns a very compact and effective model for recovering sparse graphs from data. We do an extensive theoretical analysis that strengthen our claims for using similar approaches for other problems as well. Finally, we further build up on the proposed `Unrolled Algorithm' technique for a challenging real world computational biology problem. To this end, we design GRNUlar, a novel deep learning framework for supervised learning of gene regulatory networks (GRNs) from single cell RNA-Sequencing data. Our framework incorporates two intertwined models. We first leverage the expressive ability of neural networks to capture complex dependencies between transcription factors and the corresponding genes they regulate, by developing a multi-task learning framework. Then, in order to capture sparsity of GRNs observed in the real world, we design an unrolled algorithm technique for our framework. Our deep architecture requires supervision for training, for which we repurpose existing synthetic data simulators that generate scRNA-Seq data guided by an underlying GRN. Experimental results demonstrate GRNUlar outperforms state-of-the-art methods on both synthetic and real datasets. Our work also demonstrates the novel and successful use of expression data simulators for supervised learning of GRN inference.
Long read mapping at scale: Algorithms and applications

(Georgia Institute of Technology, 2019-04-01) Jain, Chirag

Capability to sequence DNA has been around for four decades now, providing ample time to explore its myriad applications and the concomitant development of bioinformatics methods to support them. Nevertheless, disruptive technological changes in sequencing often upend prevailing protocols and characteristics of what can be sequenced, necessitating a new direction of development for bioinformatics algorithms and software. We are now at the cusp of the next revolution in sequencing due to the development of long and ultra-long read sequencing technologies by Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT). Long reads are attractive because they narrow the scale gap between sizes of genomes and sizes of sequenced reads, with the promise of avoiding assembly errors and repeat resolution challenges that plague short read assemblers. However, long reads themselves sport error rates in the vicinity of 10-15%, compared to the high accuracy of short reads (< 1%). There is an urgent need to develop bioinformatics methods to fully realize the potential of long-read sequencers. Mapping and alignment of reads to a reference is typically the first step in genomics applications. Though long read technologies are still evolving, research efforts in bioinformatics have already produced many alignment-based and alignment-free read mapping algorithms. Yet, much work lays ahead in designing provably efficient algorithms, formally characterizing the quality of results, and developing methods that scale to larger input datasets and growing reference databases. While the current model to represent the reference as a collection of linear genomes is still favored due to its simplicity, mapping to graph-based representations, where the graph encodes genetic variations in a human population also becomes imperative. This dissertation work is focused on provably good and scalable algorithms for mapping long reads to both linear and graph references. We make the following contributions: 1. We develop fast and approximate algorithms for end-to-end and split mapping of long reads to reference genomes. Our work is the first to demonstrate scaling to the entire NCBI database, the collection of all curated and non-redundant genomes. 2. We generalize the mapping algorithm to accelerate the related problems of computing pairwise whole-genome comparisons. We shed light on two fundamental biological questions concerning genomic duplications and delineating microbial species boundaries. 3. We provide new complexity results for aligning reads to graphs under Hamming and edit distance models to classify the problem variants for which existence of a polynomial time solution is unlikely. In contrast to prior results that assume alphabets as a function of the problem size, we prove that the problem variants that allow edits in graph remain NP-complete for even constant-sized alphabets, thereby resolving computational complexity of the problem for DNA and protein sequence to graph alignments. 4. Finally, we propose a new parallel algorithm to optimally align long reads to large variation graphs derived from human genomes. It demonstrates near linear scaling on multi-core CPUs, resulting in run-time reduction from multiple days to three hours when aligning a long read set to an MHC human variation graph.
Parallel and scalable combinatorial string algorithms on distributed memory systems

(Georgia Institute of Technology, 2019-03-29) Flick, Patrick

Methods for processing and analyzing DNA and genomic data are built upon combinatorial graph and string algorithms. The advent of high-throughput DNA sequencing is enabling the generation of billions of reads per experiment. Classical and sequential algorithms can no longer deal with these growing data sizes - which for the last 10 years have greatly out-paced advances in processor speeds. Processing and analyzing state-of-the-art genomic data sets require the design of scalable and efficient parallel algorithms and the use of large computing clusters. Suffix arrays and trees are fundamental string data structures, which lie at the foundation of many string algorithms, with important applications in text processing, information retrieval, and computational biology. Conversely, the parallel construction of these indices is an actively studied problem. However, prior approaches lacked good worst-case run-time guarantees and exhibit poor scaling and overall performance. In this work, we present our distributed-memory parallel algorithms for indexing large datasets, including algorithms for the distributed construction of suffix arrays, LCP arrays, and suffix trees. We formulate a generalized version of the All-Nearest-Smaller-Values problem, provide an optimal distributed solution, and apply it to the distributed construction of suffix trees - yielding a work-optimal parallel algorithm. Our algorithms for distributed suffix array and suffix tree construction improve the state-of-the-art by simultaneously improving worst-case run-time bounds and achieving superior practical performance. Next, we introduce a novel distributed string index, the Distributed Enhanced Suffix Array (DESA) - based on the suffix and LCP arrays, the DESA consists of these and additional distributed data structures. The DESA is designed to allow efficient pattern search queries in distributed memory while requiring at most O(n/p) memory per process. We present efficient distributed-memory parallel algorithms for querying, as well as for the efficient construction of this distributed index. Finally, we present our work on distributed-memory algorithms for clustering de Bruijn graphs and its application to solving a grand challenge metagenomic dataset.
Techniques to improve genome assembly quality

(Georgia Institute of Technology, 2019-03-28) Nihalani, Rahul

De-novo genome assembly is an important problem in the field of genomics. Discovering and analysing genomes of different species has numerous applications. For humans, it can lead to early detection of disease traits and timely prevention of diseases like cancer. In addition, it is useful in discovering genomes of unknown species. Even though it has received enormous attention in the last couple of decades, the problem remains unsolved to a satisfactory level, as shown in various scientific studies. Paired-end sequencing is a technology that sequences pairs of short strands from a genome, called reads. The pairs of reads originate from nearby genomic locations, and are commonly used to help more accurately determine the genomic location of individual reads and resolve repeats in genome assembly. In this thesis, we describe the genome assembly problem, and the key challenges involved in solving it. We discuss related work where we describe the two most popular models to approach the problem: de-Bruijn graphs and overlap graphs, along with their pros and cons. We describe our proposed techniques to improve the quality of genome assembly. Our main contribution in this work is designing a de-Bruijn graph based assembly algorithm to effectively utilize paired reads to improve genome assembly quality. We also discuss how our algorithm tackles some of the key challenges involved in genome assembly. We adapt this algorithm to design a parallel strategy to obtain high quality assembly for large datasets such as rice within reasonable time-frame. In addition, we describe our work on probabilistically estimating overlap graphs for large short reads datasets. We discuss the results obtained for our work, and conclude with some future work.
Distributed memory building blocks for massive biological sequence analysis

(Georgia Institute of Technology, 2018-04-03) Pan, Tony C.

K-mer indices and de Bruijn graphs are important data structures in bioinformatics with multiple applications ranging from foundational tasks such as error correction, alignment, and genome assembly, to knowledge discovery tasks including repeat detection and SNP identification. While advances in next generation sequencing technologies have dramatically reduced the cost and improved latency and throughput, few bioinformatics tools can efficiently process the data sets at the current generation rate of 1.8 terabases every 3 days. The volume and velocity with which sequencing data is generated necessitate efficient algorithms and implementation of k-mer indices and de Bruijn graphs, two central components in bioinformatic applications. Existing applications that utilize k-mer counting and de Bruijn graphs, however, tend to provide embedded, specialized implementations. The research presented here represents efforts toward the creation of the first reusable, flexible, and extensible distributed memory parallel libraries for k-mer indexing and de Bruijn graphs. These libraries are intended for simplifying the development of bioinformatics applications for distributed memory environments. For each library, our goals are to create a set of API that are simple to use, and provide optimized implementations based on efficient parallel algorithms. We designed algorithms that minimize communication volume and latency, and developed implementations with better cache utilization and SIMD vectorization. We developed Kmerind, a k-mer counting and indexing library based on distributed memory hash table and distributed sorted arrays, that provide efficient insert, find, count, and erase operations. For de Bruijn graphs, we developed Bruno by leveraging Kmerind functionalities to support parallel de Bruijn graph construction, chain compaction, error removal, and graph traversal and element query. Our performance evaluations showed that Kmerind is scalable and high performance. Kmerind counted k-mers in a 120GB data set in less than 13 seconds on 1024 cores, and indexing the k-mer positions in 17 seconds. Using the Cori supercomputer and incorporating architecture aware optimizations as well as MPI-OpenMP hybrid computation and overlapped communication, Kmerind was able to count a 350GB data set in 4.1 seconds using 4096 cores. Kmerind has been shown to out-perform the state-of-the-art k-mer counting tools at 32 to 64 cores on a shared memory system. The Bruno library is built on Kmerind and implements efficient algorithms for construction, compaction, and error removal. It is capable of constructing, compacting,and generating unitigs for a 694GB human read data set in 7.3 seconds on 7680 Edison cores. It is 1.4X and 3.7X faster than its state-of-the-art alternatives in shared and distributed memory environments, respectively. Error removal in a graph constructed from an 162 GB data set completed in 13.1 and 3.91 seconds with frequency filter of 2 and 4 respectively on 16 nodes, totaling 512 cores. While our target domain is bioinformatics, we approached algorithm design and implementation with the aim for broader applicabilities in computer science and other application domains. As a result, our chain compaction and cycle detection algorithms can feasibly be applied to general graphs, and our distributed and sequential cache friendly hash tables as well as vectorized hash functions are generic and application neutral.
Algorithmic techniques for the micron automata processor

(Georgia Institute of Technology, 2015-05-15) Roy, Indranil

Our research is the first in-depth study in the use of the Micron Automata Processor, a novel re-configurable streaming co-processor which is purpose-built to execute thousands of Non-deterministic Finite Automata (NFA) in parallel. By design, this processor is well-suited to accelerate applications which need to find all occurrences of thousands of complex string-patterns in the input data. We have validated this by implementing two such applications, one from network security and the other from bioinformatics, both of which are significantly faster than their state-of-art counterparts. Our research has also widened the scope of the applications which can be accelerated through this processor by finding ways to quickly program any generic graph into it and then search for hard to find features like maximal-cliques and Hamiltonian paths. These applications and algorithms have yielded valuable design-inputs for next generation of the chip which is currently in design phase. We hope that this work paves the way to the early adoption of this upcoming architecture and to efficient solution of some of the currently computationally challenging problems.