Organizational Unit:

School of Computational Science and Engineering

Permanent Link

https://hdl.handle.net/1853/70780

Parent Organization

Organizational Unit

College of Computing

ArchiveSpace Name Record

https://finding-aids.library.gatech.edu/agents/corporate_entities/1111

Full item page

Publication Search Results

Now showing 1 - 4 of 4

Efficient parallel algorithms for error correction and transcriptome assembly of biological sequences

(Georgia Institute of Technology, 2018-05-29) Sachdeva, Vipin

Next-generation sequencing technologies have led to a big data age in biology. Since the sequencing of the human genome, the primary bottleneck has steadily moved from collection to storage and analysis of the data. The primary contributions of this dissertation are design and implementation of novel parallel algorithms for two important problems in bioinformatics – error-correction and transcriptome assembly. For error-correction, we focused on k-mer spectrum based error-correction application called Reptile. We designed a novel distributed memory algorithm that divided the k-mer and tiles amongst the processing ranks. This allows any hardware with any memory size per node to be employed for error-correction using Reptile’s algorithm, irrespective of the size of the dataset. Our implementational achieved highly scalable results for E.Coli, Drosophila as well as the human datasets which consisted of 1.55 billion reads. Besides an algorithm that distributes k-mers and tiles between ranks, we have also implemented numerous heuristics that are useful to adjust the algorithm based on the hardware traits. We also implemented an extension of our parallel algorithm further by using pre-generating tiles and using collective messages to reduce the number of point to point messages for error-correction. Further extensions of this work have focused to create a library for distributed k-mer processing which has applications to problems in metagenomics. For transcriptome assembly, we have implemented a hybrid MPI-OpenMP approach for Chrysalis, which is part of the Trinity pipeline. Chrysalis clusters minimally overlapping contigs obtained from the prior module in Trinity called Inchworm. With this parallelization, we were able to reduce the runtime of the Chrysalis step of the Trinity workflow from over 50 hours to less than 5 hours for the sugarbeet dataset. We also employed this implementation to complete transcriptome of a 1.5 billion reads dataset pooled from different bread wheat cultivars. Furthermore, we have also implemented a MapReduce based approach to clustering k-mers which has application to the parallelization of the Inchworm module of Trinity. This implementation is a significant step towards making de novo transcriptome assembly feasible for ever bigger transcriptome datasets.
A Cache-Aware Parallel Implementation of the Push-Relabel Network Flow Algorithm and Experimental Evaluation of the Gap Relabeling Heuristic

(Georgia Institute of Technology, 2006-02-25) Bader, David A. ; Sachdeva, Vipin

The maximum flow problem is a combinatorial problem of significant importance in a wide variety of research and commercial applications. It has been extensively studied and implemented over the past 40 years. The push-relabel method has been shown to be superior to other methods, both in theoretical bounds and in experimental implementations. Our study discusses the implementation of the push-relabel network flow algorithm on present-day symmetric multiprocessors (SMP's) with large shared memories. The maximum flow problem is an irregular graph problem and requires frequent fine-grained locking of edges and vertices. Over a decade ago, Anderson and Setubal implemented Goldberg's push-relabel algorithm for shared memory parallel computers; however, modern systems differ significantly from those targeted by their implementation in that SMP's today have deep memory hierarchies and different performance costs for synchronization and fine-grained locking. Besides our new cache-aware implementation of Goldberg's parallel algorithm for modern shared-memory parallel computers, our main new contribution is the first parallel implementation and analysis of the gap relabeling heuristic that runs from 2.1 to 4.3 times faster for sparse graphs.
BioPerf: A Benchmark Suite to Evaluate High-Performance Computer Architecture on Bioinformatics Applications

(Georgia Institute of Technology, 2006) Bader, David A. ; Li, Yue ; Li, Tao ; Sachdeva, Vipin

The exponential growth in the amount of genomic data has spurred growing interest in large scale analysis of genetic information. Bioinformatics applications, which explore computational methods to allow researchers to sift through the massive biological data and extract useful information, are becoming increasingly important computer workloads. This paper presents BioPerf, a benchmark suite of representative bioinformatics applications to facilitate the design and evaluation of high-performance computer architectures for these emerging workloads. Currently, the BioPerf suite contains codes from 10 highly popular bioinformatics packages and covers the major fields of study in computational biology such as sequence comparison, phylogenetic reconstruction, protein structure prediction, and sequence homology & gene finding. We demonstrate the use of BioPerf by providing simulation points of pre-compiled Alpha binaries and with a performance study on IBM Power using IBM Mambo simulations cross-compared with Apple G5 executions. The BioPerf suite (available from www.bioperf.org) includes benchmark source code, input datasets of various sizes, and information for compiling and using the benchmarks. Our benchmark suite includes parallel codes where available.
An Open Benchmark Suite for Evaluating Computer Architecture on Bioinformatics and Life Science Applications

(Georgia Institute of Technology, 2006) Bader, David A. ; Sachdeva, Vipin

In this paper, we propose BIOPERF, a definitive benchmark suite of representative applications from the biology and life sciences community, where the codes are carefully selected to span a breadth of algorithms and performance characteristics. The BIOPERF suite is available from www.bioperf. org and includes benchmark source code, input datasets of various sizes, and information for compiling and using the benchmarks. We include parallel codes where available.

Organizational Unit:

School of Computational Science and Engineering

Permanent Link

Research Organization Registry ID

Description

Previous Names

Parent Organization

Parent Organization

Includes Organization(s)

ArchiveSpace Name Record

Filters

Author

Advisor

Date

Organization

Series

Resource Type

Resource Subtype

Has files

Record Type

Settings

Sort By

Results per page

Publication Search Results

Georgia Tech Library

Organizational Unit: School of Computational Science and Engineering

Permanent Link

Research Organization Registry ID

Description

Previous Names

Parent Organization

Parent Organization

Includes Organization(s)

ArchiveSpace Name Record

Filters

Author

Advisor

Date

Organization

Series

Resource Type

Resource Subtype

Has files

Record Type

Settings

Sort By

Results per page

Publication Search Results

Organizational Unit:

School of Computational Science and Engineering