Organizational Unit:
School of Computational Science and Engineering

Research Organization Registry ID
Description
Previous Names
Parent Organization
Parent Organization
Organizational Unit
Includes Organization(s)

Publication Search Results

Now showing 1 - 4 of 4
  • Item
    Efficient parallel algorithms for error correction and transcriptome assembly of biological sequences
    (Georgia Institute of Technology, 2018-05-29) Sachdeva, Vipin
    Next-generation sequencing technologies have led to a big data age in biology. Since the sequencing of the human genome, the primary bottleneck has steadily moved from collection to storage and analysis of the data. The primary contributions of this dissertation are design and implementation of novel parallel algorithms for two important problems in bioinformatics – error-correction and transcriptome assembly. For error-correction, we focused on k-mer spectrum based error-correction application called Reptile. We designed a novel distributed memory algorithm that divided the k-mer and tiles amongst the processing ranks. This allows any hardware with any memory size per node to be employed for error-correction using Reptile’s algorithm, irrespective of the size of the dataset. Our implementational achieved highly scalable results for E.Coli, Drosophila as well as the human datasets which consisted of 1.55 billion reads. Besides an algorithm that distributes k-mers and tiles between ranks, we have also implemented numerous heuristics that are useful to adjust the algorithm based on the hardware traits. We also implemented an extension of our parallel algorithm further by using pre-generating tiles and using collective messages to reduce the number of point to point messages for error-correction. Further extensions of this work have focused to create a library for distributed k-mer processing which has applications to problems in metagenomics. For transcriptome assembly, we have implemented a hybrid MPI-OpenMP approach for Chrysalis, which is part of the Trinity pipeline. Chrysalis clusters minimally overlapping contigs obtained from the prior module in Trinity called Inchworm. With this parallelization, we were able to reduce the runtime of the Chrysalis step of the Trinity workflow from over 50 hours to less than 5 hours for the sugarbeet dataset. We also employed this implementation to complete transcriptome of a 1.5 billion reads dataset pooled from different bread wheat cultivars. Furthermore, we have also implemented a MapReduce based approach to clustering k-mers which has application to the parallelization of the Inchworm module of Trinity. This implementation is a significant step towards making de novo transcriptome assembly feasible for ever bigger transcriptome datasets.
  • Item
    A Cache-Aware Parallel Implementation of the Push-Relabel Network Flow Algorithm and Experimental Evaluation of the Gap Relabeling Heuristic
    (Georgia Institute of Technology, 2006-02-25) Bader, David A. ; Sachdeva, Vipin
    The maximum flow problem is a combinatorial problem of significant importance in a wide variety of research and commercial applications. It has been extensively studied and implemented over the past 40 years. The push-relabel method has been shown to be superior to other methods, both in theoretical bounds and in experimental implementations. Our study discusses the implementation of the push-relabel network flow algorithm on present-day symmetric multiprocessors (SMP's) with large shared memories. The maximum flow problem is an irregular graph problem and requires frequent fine-grained locking of edges and vertices. Over a decade ago, Anderson and Setubal implemented Goldberg's push-relabel algorithm for shared memory parallel computers; however, modern systems differ significantly from those targeted by their implementation in that SMP's today have deep memory hierarchies and different performance costs for synchronization and fine-grained locking. Besides our new cache-aware implementation of Goldberg's parallel algorithm for modern shared-memory parallel computers, our main new contribution is the first parallel implementation and analysis of the gap relabeling heuristic that runs from 2.1 to 4.3 times faster for sparse graphs.
  • Item
    BioPerf: A Benchmark Suite to Evaluate High-Performance Computer Architecture on Bioinformatics Applications
    (Georgia Institute of Technology, 2006) Bader, David A. ; Li, Yue ; Li, Tao ; Sachdeva, Vipin
    The exponential growth in the amount of genomic data has spurred growing interest in large scale analysis of genetic information. Bioinformatics applications, which explore computational methods to allow researchers to sift through the massive biological data and extract useful information, are becoming increasingly important computer workloads. This paper presents BioPerf, a benchmark suite of representative bioinformatics applications to facilitate the design and evaluation of high-performance computer architectures for these emerging workloads. Currently, the BioPerf suite contains codes from 10 highly popular bioinformatics packages and covers the major fields of study in computational biology such as sequence comparison, phylogenetic reconstruction, protein structure prediction, and sequence homology & gene finding. We demonstrate the use of BioPerf by providing simulation points of pre-compiled Alpha binaries and with a performance study on IBM Power using IBM Mambo simulations cross-compared with Apple G5 executions. The BioPerf suite (available from www.bioperf.org) includes benchmark source code, input datasets of various sizes, and information for compiling and using the benchmarks. Our benchmark suite includes parallel codes where available.
  • Item
    An Open Benchmark Suite for Evaluating Computer Architecture on Bioinformatics and Life Science Applications
    (Georgia Institute of Technology, 2006) Bader, David A. ; Sachdeva, Vipin
    In this paper, we propose BIOPERF, a definitive benchmark suite of representative applications from the biology and life sciences community, where the codes are carefully selected to span a breadth of algorithms and performance characteristics. The BIOPERF suite is available from www.bioperf. org and includes benchmark source code, input datasets of various sizes, and information for compiling and using the benchmarks. We include parallel codes where available.