Title:
Long read mapping at scale: Algorithms and applications

dc.contributor.advisor Aluru, Srinivas
dc.contributor.author Jain, Chirag
dc.contributor.committeeMember Konstantinidis, Konstantinos T.
dc.contributor.committeeMember Catalyurek, Umit
dc.contributor.committeeMember Phillippy, Adam M.
dc.contributor.committeeMember Jordan, King
dc.contributor.department Computational Science and Engineering
dc.date.accessioned 2019-05-29T14:03:13Z
dc.date.available 2019-05-29T14:03:13Z
dc.date.created 2019-05
dc.date.issued 2019-04-01
dc.date.submitted May 2019
dc.date.updated 2019-05-29T14:03:13Z
dc.description.abstract Capability to sequence DNA has been around for four decades now, providing ample time to explore its myriad applications and the concomitant development of bioinformatics methods to support them. Nevertheless, disruptive technological changes in sequencing often upend prevailing protocols and characteristics of what can be sequenced, necessitating a new direction of development for bioinformatics algorithms and software. We are now at the cusp of the next revolution in sequencing due to the development of long and ultra-long read sequencing technologies by Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT). Long reads are attractive because they narrow the scale gap between sizes of genomes and sizes of sequenced reads, with the promise of avoiding assembly errors and repeat resolution challenges that plague short read assemblers. However, long reads themselves sport error rates in the vicinity of 10-15%, compared to the high accuracy of short reads (< 1%). There is an urgent need to develop bioinformatics methods to fully realize the potential of long-read sequencers. Mapping and alignment of reads to a reference is typically the first step in genomics applications. Though long read technologies are still evolving, research efforts in bioinformatics have already produced many alignment-based and alignment-free read mapping algorithms. Yet, much work lays ahead in designing provably efficient algorithms, formally characterizing the quality of results, and developing methods that scale to larger input datasets and growing reference databases. While the current model to represent the reference as a collection of linear genomes is still favored due to its simplicity, mapping to graph-based representations, where the graph encodes genetic variations in a human population also becomes imperative. This dissertation work is focused on provably good and scalable algorithms for mapping long reads to both linear and graph references. We make the following contributions: 1. We develop fast and approximate algorithms for end-to-end and split mapping of long reads to reference genomes. Our work is the first to demonstrate scaling to the entire NCBI database, the collection of all curated and non-redundant genomes. 2. We generalize the mapping algorithm to accelerate the related problems of computing pairwise whole-genome comparisons. We shed light on two fundamental biological questions concerning genomic duplications and delineating microbial species boundaries. 3. We provide new complexity results for aligning reads to graphs under Hamming and edit distance models to classify the problem variants for which existence of a polynomial time solution is unlikely. In contrast to prior results that assume alphabets as a function of the problem size, we prove that the problem variants that allow edits in graph remain NP-complete for even constant-sized alphabets, thereby resolving computational complexity of the problem for DNA and protein sequence to graph alignments. 4. Finally, we propose a new parallel algorithm to optimally align long reads to large variation graphs derived from human genomes. It demonstrates near linear scaling on multi-core CPUs, resulting in run-time reduction from multiple days to three hours when aligning a long read set to an MHC human variation graph.
dc.description.degree Ph.D.
dc.format.mimetype application/pdf
dc.identifier.uri http://hdl.handle.net/1853/61258
dc.language.iso en_US
dc.publisher Georgia Institute of Technology
dc.subject Long reads
dc.subject Alignment
dc.subject Sequence mapping
dc.subject Variation graphs
dc.subject Genomics
dc.title Long read mapping at scale: Algorithms and applications
dc.type Text
dc.type.genre Dissertation
dspace.entity.type Publication
local.contributor.advisor Aluru, Srinivas
local.contributor.corporatename College of Computing
local.contributor.corporatename School of Computational Science and Engineering
relation.isAdvisorOfPublication da8266a7-bec4-435e-a6b8-2a4249e85863
relation.isOrgUnitOfPublication c8892b3c-8db6-4b7b-a33a-1b67f7db2021
relation.isOrgUnitOfPublication 01ab2ef1-c6da-49c9-be98-fbd1d840d2b1
thesis.degree.level Doctoral
Files
Original bundle
Now showing 1 - 1 of 1
Thumbnail Image
Name:
JAIN-DISSERTATION-2019.pdf
Size:
7.31 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
LICENSE.txt
Size:
3.86 KB
Format:
Plain Text
Description: