Title:
Distributed memory building blocks for massive biological sequence analysis

dc.contributor.advisor Bader, David A.
dc.contributor.advisor Aluru, Srinivas
dc.contributor.author Pan, Tony C.
dc.contributor.committeeMember Catalyurek, Umit
dc.contributor.committeeMember Vuduc, Richard
dc.contributor.committeeMember Jordan, King
dc.contributor.committeeMember Vannberg, Fredrick
dc.contributor.department Computational Science and Engineering
dc.date.accessioned 2018-05-31T18:15:19Z
dc.date.available 2018-05-31T18:15:19Z
dc.date.created 2018-05
dc.date.issued 2018-04-03
dc.date.submitted May 2018
dc.date.updated 2018-05-31T18:15:19Z
dc.description.abstract K-mer indices and de Bruijn graphs are important data structures in bioinformatics with multiple applications ranging from foundational tasks such as error correction, alignment, and genome assembly, to knowledge discovery tasks including repeat detection and SNP identification. While advances in next generation sequencing technologies have dramatically reduced the cost and improved latency and throughput, few bioinformatics tools can efficiently process the data sets at the current generation rate of 1.8 terabases every 3 days. The volume and velocity with which sequencing data is generated necessitate efficient algorithms and implementation of k-mer indices and de Bruijn graphs, two central components in bioinformatic applications. Existing applications that utilize k-mer counting and de Bruijn graphs, however, tend to provide embedded, specialized implementations. The research presented here represents efforts toward the creation of the first reusable, flexible, and extensible distributed memory parallel libraries for k-mer indexing and de Bruijn graphs. These libraries are intended for simplifying the development of bioinformatics applications for distributed memory environments. For each library, our goals are to create a set of API that are simple to use, and provide optimized implementations based on efficient parallel algorithms. We designed algorithms that minimize communication volume and latency, and developed implementations with better cache utilization and SIMD vectorization. We developed Kmerind, a k-mer counting and indexing library based on distributed memory hash table and distributed sorted arrays, that provide efficient insert, find, count, and erase operations. For de Bruijn graphs, we developed Bruno by leveraging Kmerind functionalities to support parallel de Bruijn graph construction, chain compaction, error removal, and graph traversal and element query. Our performance evaluations showed that Kmerind is scalable and high performance. Kmerind counted k-mers in a 120GB data set in less than 13 seconds on 1024 cores, and indexing the k-mer positions in 17 seconds. Using the Cori supercomputer and incorporating architecture aware optimizations as well as MPI-OpenMP hybrid computation and overlapped communication, Kmerind was able to count a 350GB data set in 4.1 seconds using 4096 cores. Kmerind has been shown to out-perform the state-of-the-art k-mer counting tools at 32 to 64 cores on a shared memory system. The Bruno library is built on Kmerind and implements efficient algorithms for construction, compaction, and error removal. It is capable of constructing, compacting,and generating unitigs for a 694GB human read data set in 7.3 seconds on 7680 Edison cores. It is 1.4X and 3.7X faster than its state-of-the-art alternatives in shared and distributed memory environments, respectively. Error removal in a graph constructed from an 162 GB data set completed in 13.1 and 3.91 seconds with frequency filter of 2 and 4 respectively on 16 nodes, totaling 512 cores. While our target domain is bioinformatics, we approached algorithm design and implementation with the aim for broader applicabilities in computer science and other application domains. As a result, our chain compaction and cycle detection algorithms can feasibly be applied to general graphs, and our distributed and sequential cache friendly hash tables as well as vectorized hash functions are generic and application neutral.
dc.description.degree Ph.D.
dc.format.mimetype application/pdf
dc.identifier.uri http://hdl.handle.net/1853/59894
dc.language.iso en_US
dc.publisher Georgia Institute of Technology
dc.subject High performance computing
dc.subject Bioinformatics
dc.subject K-mer index
dc.subject K-mer counting
dc.subject De bruijn graph
dc.subject Next generation sequencing
dc.subject Parallel algorithms
dc.subject Distributed memory
dc.subject Distributed algorithms
dc.subject SIMD vectorization
dc.subject Cache friendly algorithms
dc.subject MPI
dc.title Distributed memory building blocks for massive biological sequence analysis
dc.type Text
dc.type.genre Dissertation
dspace.entity.type Publication
local.contributor.corporatename College of Computing
local.contributor.corporatename School of Computational Science and Engineering
relation.isOrgUnitOfPublication c8892b3c-8db6-4b7b-a33a-1b67f7db2021
relation.isOrgUnitOfPublication 01ab2ef1-c6da-49c9-be98-fbd1d840d2b1
thesis.degree.level Doctoral
Files
Original bundle
Now showing 1 - 1 of 1
Thumbnail Image
Name:
PAN-DISSERTATION-2018.pdf
Size:
4.53 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
LICENSE.txt
Size:
3.86 KB
Format:
Plain Text
Description: