Title:
Scalable mapping-free algorithms and K-mer data structures for genomic analysis

dc.contributor.advisor Vannberg, Fredrik O.
dc.contributor.author Audano, Peter Anthony
dc.contributor.committeeMember Jordan, Irving K.
dc.contributor.committeeMember Aluru, Srinivas
dc.contributor.committeeMember Hammer, Brian
dc.contributor.committeeMember Gibson, Gregory
dc.contributor.department Biology
dc.date.accessioned 2017-08-17T18:56:51Z
dc.date.available 2017-08-17T18:56:51Z
dc.date.created 2016-08
dc.date.issued 2016-07-19
dc.date.submitted August 2016
dc.date.updated 2017-08-17T18:56:51Z
dc.description.abstract An organism’s DNA sequence is a virtual cornucopia of information, and sequencing technology is the key to unlocking it. The past few decades have been witness to meteoric growth of these technologies and the volume of data they produce. New industries are forming around it, existing ones are changing as a result of it, and modern medicine is on the precipice of a genomic revolution. Turning this deluge of data into useful information is already a challenging task with advances in sequencing technology far outstripping advances in semiconductors, and this trend shows no signs of stopping. Incremental advances are being made in sequence analysis techniques, but progress is far too slow to keep up with the volume of data delivered by modern sequencing platforms. This gap is often filled by allocating more computing resources in the form of distributed computing platforms, and this can quickly become prohibitive. Because medicine requires a quick answer and science often has limited funding, the analysis bottleneck is a major concern. Instead of finding new ways to dedicate more computing resources to the problem, I am interested in streamlining the process. In this dissertation, I explore methods to make the analysis faster and more efficient. My ultimate goal is to create algorithms that can run on a standard computer workstation, and when necessary, make the most of expensive distributed and cloud computing resources. Many analysis pipelines start by aligning sequence reads to a reference or assembling them into long consensus sequences, but it can take several hours to analyze a single sample on a workstation computer. Instead aligning or assembling sequence reads, the approaches described in this dissertation transform and analyze the sequence data without alignments. These alignment-free approaches often improve performance by an order of magnitude or more. The first step for many alignment-free approaches involves transforming the sequence reads into k-mers, which are short overlapping fragments of uniform size. If the size, k, is 48, then all substrings of length 48 are extracted from each sequence read and counted. The resulting frequency of each k-mer can then be used as evidence for genomic analysis techniques. When I started as graduate student, I began by working on a program to transform sequence data to k-mer frequencies. The resulting software, KAnalyze, was incredibly flexible because of its software architecture and approach to solving the k-mer counting problem. This became the foundation for the rest of my graduate work because it was possible to test new ideas that required data structures and transformations not commonly used today. After my initial work with k-mers, I was connected with the Centers for Disease Control and Prevention (CDC) Mycobacterium tuberculosis (M. tuberculosis) science team. They were interested in replacing existing software with an alignment-free approach based on k-mers. The new software significantly reduced analysis time, and it reduced errors. K-mers are more rigid than sequence aligments, so I had to find a way to correct for mutations in the samples. This resulted in a novel algorithm that could identify single nucleotide polymorphism (SNP) and insertion/deletion (indel) mutations, and it still took far less time than the alignment approach. I am not aware of an error correction algorithm with k-mers that does not employ a simple hamming distance calculation, and therefore, is capable of handling indel variants. The CDC M. tuberculosis project solved this problem in a naı̈ve way; when a variant was detected, it took 4 paths and assumed that it might be a SNP, insertion, deletion, or no variant. The result is an algorithm that runs with a computational time complexity of O(4n ). Although this worked well on the short reference sequences analyzed for this project, it would never scale to larger references. After some thought and experimentation, I realized that the reconstruction algorithm I had created was generating a sequence that was ostensibly related to the reference, and comparing two sequences to determine how well they match is the fundamental task of alignments. Instead of taking multiple paths for each base mismatch, could it be possible to guide the rebuilding algorithm by aligning the dynamically constructed sequence with the reference? Although it is not 100% alignment-free, this approach would significantly reduce the computational burden of modern methods, which aligns each of the sequence reads to the reference. What was more exciting is that it might be able to pick out variants blind to sequence read alignments, such as dense SNP loci or large insertions. This work culminated in Kestrel, which is a novel first-in-class variant caller application that uses this idea. The chapters that follow tell this story from the current state of the art to novel applications of this new technology that are under development today.
dc.description.degree Ph.D.
dc.format.mimetype application/pdf
dc.identifier.uri http://hdl.handle.net/1853/58591
dc.language.iso en_US
dc.publisher Georgia Institute of Technology
dc.subject K-mer
dc.subject Sequence analysis
dc.subject Alignment-free
dc.subject Mapping-free
dc.subject Variant calling
dc.subject Kestrel
dc.title Scalable mapping-free algorithms and K-mer data structures for genomic analysis
dc.type Text
dc.type.genre Dissertation
dspace.entity.type Publication
local.contributor.corporatename College of Sciences
local.contributor.corporatename School of Biological Sciences
relation.isOrgUnitOfPublication 85042be6-2d68-4e07-b384-e1f908fae48a
relation.isOrgUnitOfPublication c8b3bd08-9989-40d3-afe3-e0ad8d5c72b5
thesis.degree.level Doctoral
Files
Original bundle
Now showing 1 - 1 of 1
Thumbnail Image
Name:
AUDANO-DISSERTATION-2016.pdf
Size:
3.77 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
LICENSE.txt
Size:
3.86 KB
Format:
Plain Text
Description: