Scalable mapping-free algorithms and K-mer data structures for genomic analysis

Audano, Peter Anthony

Title:

Scalable mapping-free algorithms and K-mer data structures for genomic analysis

dc.contributor.advisor	Vannberg, Fredrik O.
dc.contributor.author	Audano, Peter Anthony
dc.contributor.committeeMember	Jordan, Irving K.
dc.contributor.committeeMember	Aluru, Srinivas
dc.contributor.committeeMember	Hammer, Brian
dc.contributor.committeeMember	Gibson, Gregory
dc.contributor.department	Biology
dc.date.accessioned	2017-08-17T18:56:51Z
dc.date.available	2017-08-17T18:56:51Z
dc.date.created	2016-08
dc.date.issued	2016-07-19
dc.date.submitted	August 2016
dc.date.updated	2017-08-17T18:56:51Z
dc.description.abstract	An organism’s DNA sequence is a virtual cornucopia of information, and sequencing technology is the key to unlocking it. The past few decades have been witness to meteoric growth of these technologies and the volume of data they produce. New industries are forming around it, existing ones are changing as a result of it, and modern medicine is on the precipice of a genomic revolution. Turning this deluge of data into useful information is already a challenging task with advances in sequencing technology far outstripping advances in semiconductors, and this trend shows no signs of stopping. Incremental advances are being made in sequence analysis techniques, but progress is far too slow to keep up with the volume of data delivered by modern sequencing platforms. This gap is often filled by allocating more computing resources in the form of distributed computing platforms, and this can quickly become prohibitive. Because medicine requires a quick answer and science often has limited funding, the analysis bottleneck is a major concern. Instead of finding new ways to dedicate more computing resources to the problem, I am interested in streamlining the process. In this dissertation, I explore methods to make the analysis faster and more efficient. My ultimate goal is to create algorithms that can run on a standard computer workstation, and when necessary, make the most of expensive distributed and cloud computing resources. Many analysis pipelines start by aligning sequence reads to a reference or assembling them into long consensus sequences, but it can take several hours to analyze a single sample on a workstation computer. Instead aligning or assembling sequence reads, the approaches described in this dissertation transform and analyze the sequence data without alignments. These alignment-free approaches often improve performance by an order of magnitude or more. The first step for many alignment-free approaches involves transforming the sequence reads into k-mers, which are short overlapping fragments of uniform size. If the size, k, is 48, then all substrings of length 48 are extracted from each sequence read and counted. The resulting frequency of each k-mer can then be used as evidence for genomic analysis techniques. When I started as graduate student, I began by working on a program to transform sequence data to k-mer frequencies. The resulting software, KAnalyze, was incredibly flexible because of its software architecture and approach to solving the k-mer counting problem. This became the foundation for the rest of my graduate work because it was possible to test new ideas that required data structures and transformations not commonly used today. After my initial work with k-mers, I was connected with the Centers for Disease Control and Prevention (CDC) Mycobacterium tuberculosis (M. tuberculosis) science team. They were interested in replacing existing software with an alignment-free approach based on k-mers. The new software significantly reduced analysis time, and it reduced errors. K-mers are more rigid than sequence aligments, so I had to find a way to correct for mutations in the samples. This resulted in a novel algorithm that could identify single nucleotide polymorphism (SNP) and insertion/deletion (indel) mutations, and it still took far less time than the alignment approach. I am not aware of an error correction algorithm with k-mers that does not employ a simple hamming distance calculation, and therefore, is capable of handling indel variants. The CDC M. tuberculosis project solved this problem in a naı̈ve way; when a variant was detected, it took 4 paths and assumed that it might be a SNP, insertion, deletion, or no variant. The result is an algorithm that runs with a computational time complexity of O(4n ). Although this worked well on the short reference sequences analyzed for this project, it would never scale to larger references. After some thought and experimentation, I realized that the reconstruction algorithm I had created was generating a sequence that was ostensibly related to the reference, and comparing two sequences to determine how well they match is the fundamental task of alignments. Instead of taking multiple paths for each base mismatch, could it be possible to guide the rebuilding algorithm by aligning the dynamically constructed sequence with the reference? Although it is not 100% alignment-free, this approach would significantly reduce the computational burden of modern methods, which aligns each of the sequence reads to the reference. What was more exciting is that it might be able to pick out variants blind to sequence read alignments, such as dense SNP loci or large insertions. This work culminated in Kestrel, which is a novel first-in-class variant caller application that uses this idea. The chapters that follow tell this story from the current state of the art to novel applications of this new technology that are under development today.
dc.description.degree	Ph.D.
dc.format.mimetype	application/pdf
dc.identifier.uri	http://hdl.handle.net/1853/58591
dc.language.iso	en_US
dc.publisher	Georgia Institute of Technology
dc.subject	K-mer
dc.subject	Sequence analysis
dc.subject	Alignment-free
dc.subject	Mapping-free
dc.subject	Variant calling
dc.subject	Kestrel
dc.title	Scalable mapping-free algorithms and K-mer data structures for genomic analysis
dc.type	Text
dc.type.genre	Dissertation
dspace.entity.type	Publication
local.contributor.corporatename	College of Sciences
local.contributor.corporatename	School of Biological Sciences
relation.isOrgUnitOfPublication	85042be6-2d68-4e07-b384-e1f908fae48a
relation.isOrgUnitOfPublication	c8b3bd08-9989-40d3-afe3-e0ad8d5c72b5
thesis.degree.level	Doctoral