Long read mapping at scale: Algorithms and applications

Jain, Chirag

Title:

Long read mapping at scale: Algorithms and applications

dc.contributor.advisor	Aluru, Srinivas
dc.contributor.author	Jain, Chirag
dc.contributor.committeeMember	Konstantinidis, Konstantinos T.
dc.contributor.committeeMember	Catalyurek, Umit
dc.contributor.committeeMember	Phillippy, Adam M.
dc.contributor.committeeMember	Jordan, King
dc.contributor.department	Computational Science and Engineering
dc.date.accessioned	2019-05-29T14:03:13Z
dc.date.available	2019-05-29T14:03:13Z
dc.date.created	2019-05
dc.date.issued	2019-04-01
dc.date.submitted	May 2019
dc.date.updated	2019-05-29T14:03:13Z
dc.description.abstract	Capability to sequence DNA has been around for four decades now, providing ample time to explore its myriad applications and the concomitant development of bioinformatics methods to support them. Nevertheless, disruptive technological changes in sequencing often upend prevailing protocols and characteristics of what can be sequenced, necessitating a new direction of development for bioinformatics algorithms and software. We are now at the cusp of the next revolution in sequencing due to the development of long and ultra-long read sequencing technologies by Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT). Long reads are attractive because they narrow the scale gap between sizes of genomes and sizes of sequenced reads, with the promise of avoiding assembly errors and repeat resolution challenges that plague short read assemblers. However, long reads themselves sport error rates in the vicinity of 10-15%, compared to the high accuracy of short reads (< 1%). There is an urgent need to develop bioinformatics methods to fully realize the potential of long-read sequencers. Mapping and alignment of reads to a reference is typically the first step in genomics applications. Though long read technologies are still evolving, research efforts in bioinformatics have already produced many alignment-based and alignment-free read mapping algorithms. Yet, much work lays ahead in designing provably efficient algorithms, formally characterizing the quality of results, and developing methods that scale to larger input datasets and growing reference databases. While the current model to represent the reference as a collection of linear genomes is still favored due to its simplicity, mapping to graph-based representations, where the graph encodes genetic variations in a human population also becomes imperative. This dissertation work is focused on provably good and scalable algorithms for mapping long reads to both linear and graph references. We make the following contributions: 1. We develop fast and approximate algorithms for end-to-end and split mapping of long reads to reference genomes. Our work is the first to demonstrate scaling to the entire NCBI database, the collection of all curated and non-redundant genomes. 2. We generalize the mapping algorithm to accelerate the related problems of computing pairwise whole-genome comparisons. We shed light on two fundamental biological questions concerning genomic duplications and delineating microbial species boundaries. 3. We provide new complexity results for aligning reads to graphs under Hamming and edit distance models to classify the problem variants for which existence of a polynomial time solution is unlikely. In contrast to prior results that assume alphabets as a function of the problem size, we prove that the problem variants that allow edits in graph remain NP-complete for even constant-sized alphabets, thereby resolving computational complexity of the problem for DNA and protein sequence to graph alignments. 4. Finally, we propose a new parallel algorithm to optimally align long reads to large variation graphs derived from human genomes. It demonstrates near linear scaling on multi-core CPUs, resulting in run-time reduction from multiple days to three hours when aligning a long read set to an MHC human variation graph.
dc.description.degree	Ph.D.
dc.format.mimetype	application/pdf
dc.identifier.uri	http://hdl.handle.net/1853/61258
dc.language.iso	en_US
dc.publisher	Georgia Institute of Technology
dc.subject	Long reads
dc.subject	Alignment
dc.subject	Sequence mapping
dc.subject	Variation graphs
dc.subject	Genomics
dc.title	Long read mapping at scale: Algorithms and applications
dc.type	Text
dc.type.genre	Dissertation
dspace.entity.type	Publication
local.contributor.advisor	Aluru, Srinivas
local.contributor.corporatename	College of Computing
local.contributor.corporatename	School of Computational Science and Engineering
relation.isAdvisorOfPublication	da8266a7-bec4-435e-a6b8-2a4249e85863
relation.isOrgUnitOfPublication	c8892b3c-8db6-4b7b-a33a-1b67f7db2021
relation.isOrgUnitOfPublication	01ab2ef1-c6da-49c9-be98-fbd1d840d2b1
thesis.degree.level	Doctoral